3 datasets found
  1. AIT Log Data Set V2.0

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2024). AIT Log Data Set V2.0 [Dataset]. http://doi.org/10.5281/zenodo.5789064
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

    In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

    The datasets in this repository have the following structure:

    • The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather/.
    • The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
    • The processing directory contains the source code that was used to generate the labels.
    • The rules directory contains the labeling rules.
    • The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
    • The dataset.yml file specifies the start and end time of the simulation.

    The following table summarizes relevant properties of the datasets:

    • fox
      • Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
      • Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
      • Scan volume: High
      • Unpacked size: 26 GB
    • harrison
      • Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
      • Scan volume: High
      • Unpacked size: 27 GB
    • russellmitchell
      • Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
      • Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
      • Scan volume: Low
      • Unpacked size: 14 GB
    • santos
      • Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
      • Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
      • Scan volume: Low
      • Unpacked size: 17 GB
    • shaw
      • Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
      • Scan volume: Low
      • Data exfiltration is not visible in DNS logs
      • Unpacked size: 27 GB
    • wardbeck
      • Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
      • Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
      • Scan volume: Low
      • Unpacked size: 26 GB
    • wheeler
      • Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
      • Scan volume: High
      • No password cracking in attack chain
      • Unpacked size: 30 GB
    • wilson
      • Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
      • Scan volume: High
      • Unpacked size: 39 GB

    The following attacks are launched in the network:

    • Scans (nmap, WPScan, dirb)
    • Webshell upload (CVE-2020-24186)
    • Password cracking (John the Ripper)
    • Privilege escalation
    • Remote command execution
    • Data exfiltration (DNSteal)

    Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

    The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

    {"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

    type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

    Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.

    Version history:

    • AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
    • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,

  2. Kyoushi Log Data Set

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2023). Kyoushi Log Data Set [Dataset]. http://doi.org/10.5281/zenodo.5779411
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.

    The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

    Each dataset contains traces of a specific attack scenario:

    • Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):
      • nmap scan
      • WPScan
      • dirb scan
      • webshell upload through wpDiscuz exploit (CVE-2020-24186)
      • privilege escalation
    • Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):
      • DNSteal data exfiltration

    The log data collected from the servers includes

    • Apache access and error logs (labeled)
    • audit logs (labeled)
    • auth logs (labeled)
    • VPN logs (labeled)
    • DNS logs (labeled)
    • syslog
    • suricata logs
    • exim logs
    • horde logs
    • mail logs

    Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

    [2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". ACM Workshop on Secure and Trustworthy Cyber-Physical Systems (ACM SaT-CPS 2022), April 27, 2022, Baltimore, MD, USA. ACM.

    [3] M. Frank, "Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems", Master's Thesis, Vienna University of Technology, 2021.

  3. Cloud-based User Entity Behavior Analytics Log Data Set

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger; Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger (2023). Cloud-based User Entity Behavior Analytics Log Data Set [Dataset]. http://doi.org/10.5281/zenodo.7119953
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger; Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This respository contains the CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set). The data set contains log events from real users utilizing a cloud storage suitable for User Entity Behavior Analytics (UEBA). Events include logins, file accesses, link shares, config changes, etc. The data set contains around 50 million events generated by more than 5000 distinct users in more than five years (2017-07-07 to 2022-09-29 or 1910 days). The data set is complete except for 109 events missing on 2021-04-22, 2021-08-20, and 2021-09-05 due to database failure. The unpacked file size is around 14.5 GB. A detailed analysis of the data set is provided in [1].

    The logs are provided in JSON format with the following attributes in the first level:

    • id: Unique log line identifier that starts at 1 and increases incrementally, e.g., 1.
    • time: Time stamp of the event in ISO format, e.g., 2021-01-01T00:00:02Z.
    • uid: Unique anonymized identifier for the user generating the event, e.g., old-pink-crane-sharedealer.
    • uidType: Specifier for uid, which is either the user name or IP address for logged out users.
    • type: The action carried out by the user, e.g., file_accessed.
    • params: Additional event parameters (e.g., paths, groups) stored in a nested dictionary.
    • isLocalIP: Optional flag for event origin, which is either internal (true) or external (false).
    • role: Optional user role: consulting, administration, management, sales, technical, or external.
    • location: Optional IP-based geolocation of event origin, including city, country, longitude, latitude, etc.

    In the following data sample, the first object depicts a successful user login (see type: login_successful) and the second object depicts a file access (see type: file_accessed) from a remote location:

    {"params": {"user": "intact-gray-marlin-trademarkagent"}, "type": "login_successful", "time": "2019-11-14T11:26:43Z", "uid": "intact-gray-marlin-trademarkagent", "id": 21567530, "uidType": "name"}

    {"isLocalIP": false, "params": {"path": "/proud-copper-orangutan-artexer/doubtful-plum-ptarmigan-merchant/insufficient-amaranth-earthworm-qualitycontroller/curious-silver-galliform-tradingstandards/incredible-indigo-octopus-printfinisher/wicked-bronze-sloth-claimsmanager/frantic-aquamarine-horse-cleric"}, "type": "file_accessed", "time": "2019-11-14T11:26:51Z", "uid": "graceful-olive-spoonbill-careersofficer", "id": 21567531, "location": {"countryCode": "AT", "countryName": "Austria", "region": "4", "city": "Gmunden", "latitude": 47.915, "longitude": 13.7959, "timezone": "Europe/Vienna", "postalCode": "4810", "metroCode": null, "regionName": "Upper Austria", "isInEuropeanUnion": true, "continent": "Europe", "accuracyRadius": 50}, "uidType": "ipaddress"}

    The data set was generated at the premises of Huemer Group, a midsize IT service provider located in Vienna, Austria. Huemer Group offers a range of Infrastructure-as-a-Service solutions for enterprises, including cloud computing and storage. In particular, their cloud storage solution called hBOX enables customers to upload their data, synchronize them with multiple devices, share files with others, create versions and backups of their documents, collaborate with team members in shared data spaces, and query the stored documents using search terms. The hBOX extends the open-source project Nextcloud with interfaces and functionalities tailored to the requirements of customers.

    The data set comprises only normal user behavior, but can be used to evaluate anomaly detection approaches by simulating account hijacking. We provide an implementation for identifying similar users, switching pairs of users to simulate changes of behavior patterns, and a sample detection approach in our github repo.

    Acknowledgements: Partially funded by the FFG project DECEPT (873980). The authors thank Walter Huemer, Oskar Kruschitz, Kevin Truckenthanner, and Christian Aigner from Huemer Group for supporting the collection of the data set.

    If you use the dataset, please cite the following publication:

    [1] M. Landauer, F. Skopik, G. Höld, and M. Wurzenberger. "A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing". 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2024). AIT Log Data Set V2.0 [Dataset]. http://doi.org/10.5281/zenodo.5789064
Organization logo

AIT Log Data Set V2.0

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Jun 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

AIT Log Data Sets

This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

The datasets in this repository have the following structure:

  • The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather/.
  • The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
  • The processing directory contains the source code that was used to generate the labels.
  • The rules directory contains the labeling rules.
  • The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
  • The dataset.yml file specifies the start and end time of the simulation.

The following table summarizes relevant properties of the datasets:

  • fox
    • Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
    • Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
    • Scan volume: High
    • Unpacked size: 26 GB
  • harrison
    • Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
    • Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
    • Scan volume: High
    • Unpacked size: 27 GB
  • russellmitchell
    • Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
    • Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
    • Scan volume: Low
    • Unpacked size: 14 GB
  • santos
    • Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
    • Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
    • Scan volume: Low
    • Unpacked size: 17 GB
  • shaw
    • Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
    • Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
    • Scan volume: Low
    • Data exfiltration is not visible in DNS logs
    • Unpacked size: 27 GB
  • wardbeck
    • Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
    • Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
    • Scan volume: Low
    • Unpacked size: 26 GB
  • wheeler
    • Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
    • Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
    • Scan volume: High
    • No password cracking in attack chain
    • Unpacked size: 30 GB
  • wilson
    • Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
    • Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
    • Scan volume: High
    • Unpacked size: 39 GB

The following attacks are launched in the network:

  • Scans (nmap, WPScan, dirb)
  • Webshell upload (CVE-2020-24186)
  • Password cracking (John the Ripper)
  • Privilege escalation
  • Remote command execution
  • Data exfiltration (DNSteal)

Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.

Version history:

  • AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
  • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

If you use the dataset, please cite the following publications:

[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,

Search
Clear search
Close search
Google apps
Main menu