11 datasets found
  1. Z

    AIT Log Data Set V2.0

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skopik, Florian (2024). AIT Log Data Set V2.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5789063
    Explore at:
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Rauber, Andreas
    Skopik, Florian
    Wurzenberger, Markus
    Landauer, Max
    Hotwagner, Wolfgang
    Frank, Maximilian
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

    In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

    The datasets in this repository have the following structure:

    The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.

    The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.

    The processing directory contains the source code that was used to generate the labels.

    The rules directory contains the labeling rules.

    The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.

    The dataset.yml file specifies the start and end time of the simulation.

    The following table summarizes relevant properties of the datasets:

    fox

    Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00

    Attack time: 2022-01-18 11:59 - 2022-01-18 13:15

    Scan volume: High

    Unpacked size: 26 GB

    harrison

    Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00

    Attack time: 2022-02-08 07:07 - 2022-02-08 08:38

    Scan volume: High

    Unpacked size: 27 GB

    russellmitchell

    Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00

    Attack time: 2022-01-24 03:01 - 2022-01-24 04:39

    Scan volume: Low

    Unpacked size: 14 GB

    santos

    Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00

    Attack time: 2022-01-17 11:15 - 2022-01-17 11:59

    Scan volume: Low

    Unpacked size: 17 GB

    shaw

    Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00

    Attack time: 2022-01-29 14:37 - 2022-01-29 15:21

    Scan volume: Low

    Data exfiltration is not visible in DNS logs

    Unpacked size: 27 GB

    wardbeck

    Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00

    Attack time: 2022-01-23 12:10 - 2022-01-23 12:56

    Scan volume: Low

    Unpacked size: 26 GB

    wheeler

    Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00

    Attack time: 2022-01-30 07:35 - 2022-01-30 17:53

    Scan volume: High

    No password cracking in attack chain

    Unpacked size: 30 GB

    wilson

    Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00

    Attack time: 2022-02-07 10:57 - 2022-02-07 11:49

    Scan volume: High

    Unpacked size: 39 GB

    The following attacks are launched in the network:

    Scans (nmap, WPScan, dirb)

    Webshell upload (CVE-2020-24186)

    Password cracking (John the Ripper)

    Privilege escalation

    Remote command execution

    Data exfiltration (DNSteal)

    Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

    The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

    {"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

    type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

    Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.

    Version history:

    AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

    AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]

    [2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

  2. LO2: Microservice Dataset of Logs and Metrics

    • zenodo.org
    bin, pdf, zip
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Bakhtin; Alexander Bakhtin; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä; Davide Taibi; Davide Taibi; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä (2024). LO2: Microservice Dataset of Logs and Metrics [Dataset]. http://doi.org/10.5281/zenodo.14265858
    Explore at:
    bin, zip, pdfAvailable download formats
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Bakhtin; Alexander Bakhtin; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä; Davide Taibi; Davide Taibi; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LO2 dataset

    This is the data repository for the LO2 dataset.

    Here is an overview of the contents.

    lo2-data.zip

    This is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.

    lo2-sample.zip

    This is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize.

    data-appendix.pdf

    This document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.

    lo2-scripts.zip

    Various scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.

    • csv_generator.py, csv_merge*.py: These scripts create and combine the metrics into csv files. They need to be run in order. Merging runs to global is very memory intensive.
    • findempty.py: Finds empty files in the folders. As some are expected to be empty, it also counts the unexpected ones. Used in data-appendix.
    • loglead_lo2.py: Script for the preliminary analysis of the logs for error detection. Requires LogLead version 1.2.1.
    • logstats.py: Counts log lines and their type. Used for creating the figure of number of lines per type and service.
    • node_exporter_metrics.txt: Metric descriptions exported from Prometheus (text file).
    • pca.py: The Principal Component Analysis script used for preliminary analysis.
    • reduce_logs.py: Very important for fair analysis as in the beginning of the files there are some initialization rows that leak information regarding correctness.
    • requirements.txt: Required Python libraries to run the scripts.
    • sizedist.py: Creating distributions of file sizes per filename for the data-appendix.

    Version v2: Fixed LogLead version number and minor changes in scripts

  3. Z

    Disk replacement log file examples from a very large RAID disk system for...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schomaker, Lambert (2020). Disk replacement log file examples from a very large RAID disk system for predictive maintenance analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2580161
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Schomaker, Lambert
    Strikwerda, Ger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    README.txt

    Maintenance example belonging to:

    The MANTIS Book: Cyber Physical System Based Proactive Collaborative Maintenance Chapter 9, The Future of Maintenance (2019). Lambert Schomaker, Michele Albano, Erkki Jantunen, Luis Lino Ferreira River Publishers (DK) ISBN: 9788793609853, e-ISBN: 9788793609846, https://doi.org/10.13052/rp-9788793609846

    The figure .pdf did not make it into the book. Here are the raw data, processed logs and .gnu script to produce it.

    Data: event logs on disk failure in two racks of a huge RAID disk system (2009-2016).

    disks1.raw disks2.raw

    Event logs to RC-filtered time series: RC-filt-disks-log.c do-RC-filter-to-make-spikes-more-visible (bash script) --> disks1.log disks2.log

    Constant (horizontal line) indicating the level where users experienced system-down time Disrupted-operations-threshold

    disk-replacement-log.gnu disk-replacement-log.pdf

  4. LO2: Microservice Dataset of Logs and Metrics

    • zenodo.org
    bin, pdf, zip
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Bakhtin; Alexander Bakhtin; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä; Davide Taibi; Davide Taibi; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä (2024). LO2: Microservice Dataset of Logs and Metrics [Dataset]. http://doi.org/10.5281/zenodo.14257990
    Explore at:
    zip, bin, pdfAvailable download formats
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Bakhtin; Alexander Bakhtin; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä; Davide Taibi; Davide Taibi; Jesse Nyyssölä; Yuqing Wang; Noman Ahmad; Ke Ping; Matteo Esposito; Mika Mäntylä
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LO2 dataset

    This is the data repository for the LO2 dataset.

    Here is an overview of the contents.

    lo2-data.zip

    This is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.

    lo2-sample.zip

    This is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize.

    data-appendix.pdf

    This document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.

    lo2-scripts.zip

    Various scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.

    • csv_generator.py, csv_merge*.py: These scripts create and combine the metrics into csv files. They need to be run in order. Merging runs to global is very memory intensive.
    • findempty.py: Finds empty files in the folders. As some are expected to be empty, it also counts the unexpected ones. Used in data-appendix.
    • loglead_lo2.py: Script for the preliminary analysis of the logs for error detection. Requires LogLead version 1.2.0.
    • logstats.py: Counts log lines and their type. Used for creating the figure of number of lines per type and service.
    • node_exporter_metrics.txt: Metric descriptions exported from Prometheus (text file).
    • pca.py: The Principal Component Analysis script used for preliminary analysis.
    • reduce_logs.py: Very important for fair analysis as in the beginning of the files there are some initialization rows that leak information regarding correctness.
    • requirements.txt: Required Python libraries to run the scripts.
    • sizedist.py: Creating distributions of file sizes per filename for the data-appendix.
  5. Data from: Log Decomposition Dynamics in Interior Alaska 4 - Nutrient Data

    • search.dataone.org
    • dataone.org
    • +1more
    Updated Jun 18, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Yarie; Bonanza Creek LTER (2014). Log Decomposition Dynamics in Interior Alaska 4 - Nutrient Data [Dataset]. https://search.dataone.org/view/knb-lter-bnz.411.14
    Explore at:
    Dataset updated
    Jun 18, 2014
    Dataset provided by
    Long Term Ecological Research Networkhttp://www.lternet.edu/
    Authors
    John Yarie; Bonanza Creek LTER
    Time period covered
    Jun 1, 1996 - Dec 1, 2013
    Area covered
    Variables measured
    %C, %K, %N, %P, %S, S#, %Ca, %Mg, DPC, Loc, and 19 more
    Description

    The entire dataset (all 7 files) contains detailed information on a time series study of log decomposition in interior Alaska. The species studied include white and black spruce, aspen, birch, balsam poplar and aspen starting as green trees. In addition white and black spruce in recently burned sites are included. The study was designed to produce a time series of log decomposition measurements over the next 100 years. The information to be measured on the logs includes weight and density changes over specified time periods, changes in nutrient concentrations, and hemicellulose, cellouse, and lignin concentrations, and changes in the quantity of nutrients and hemicellulose, cellouse and lignin. (This file contains the sample nutrient analysis data for the log decomposition study.)

  6. g

    Programme for the International Assessment of Adult Competencies (PIAAC),...

    • search.gesis.org
    • datacatalogue.cessda.eu
    • +2more
    Updated Dec 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OECD - Organisation for Economic Co-operation and Development; ETS - Educational Testing Service; GESIS - Leibniz-Institute for the Social Sciences; DIPF - German Institute for International Educational Research; cApStAn - Linguistic Quality Control; ROA - Research Centre for Education and the Labour Market; IEA DPC (Data Processing and research Center) - International Association for the Evaluation of Educational Achievement; Westat; Public Research Center Henri Tudor (2017). Programme for the International Assessment of Adult Competencies (PIAAC), log files [Dataset]. http://doi.org/10.4232/1.12955
    Explore at:
    (123548651), (98728590), (203045229), (121004610), (142825877), (134944984), (176109313), (128110088), (107728812), (244790579), (142594414), (139672688), (217161002), (129368670), (141204817), (124427779), (186644073)Available download formats
    Dataset updated
    Dec 22, 2017
    Dataset provided by
    GESIS search
    GESIS Data Archive
    Authors
    OECD - Organisation for Economic Co-operation and Development; ETS - Educational Testing Service; GESIS - Leibniz-Institute for the Social Sciences; DIPF - German Institute for International Educational Research; cApStAn - Linguistic Quality Control; ROA - Research Centre for Education and the Labour Market; IEA DPC (Data Processing and research Center) - International Association for the Evaluation of Educational Achievement; Westat; Public Research Center Henri Tudor
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Time period covered
    Aug 1, 2011 - Nov 24, 2012
    Description

    Objective: The PIAAC 2012 study was the first fully computer-based large scale assessment in education. During the assessment, user interactions were logged automatically. This means that most of the users’ actions within the assessment tool were recorded and stored with time stamps in separate files called log files. The log files contain paradata for each participant in the domains literacy, numeracy, and problem solving in technology-rich environments. The availability of these log files offers new opportunities to researchers, for instance to reproduce test-taking behavior of individuals and to better understand test-taking behavior.

    Method: PIAAC 2012 was conducted August 2011-November 2012 among a representative international sample of around 166000 adults within 24 different countries. The following dataset includes the log files from 17 countries. Each country was allowed to choose their own sampling technique as long as the technique applies full selection probability methods to select a representative sample from the PIAAC target population. The countries were able to oversample particular subgroups of the target population. Persons aged 55-65 and recent immigrants were oversampled in Denmark and persons aged 19-26 were oversampled in Poland. The administration of the background questionnaires was conducted face-to-face using computer assisted personal interviewing (CAPI). After the questionnaire, the respondent completed a computer-based or paper-based cognitive under the supervision of the interviewer in one or two of the following competence domains: literacy, numeracy and problem solving in technology-rich environments.

    Variables: With the help of the PIAAC LogDataAnalyzer you can generate a data set. The Log Data Extraction software is a self-contained system that manages activities like data extraction, data cleaning, and visualization of OECD-PIAAC 2012 assessment log data files. It serves as a basis for data related analysis tasks using the tool itself or by exporting the cleaned data to external tools like statistics packages. You can generate the following Variables: Number of Using Cancel Button, Number of Using Help Menu, Time on Task, Time Till the First Interaction, Final Response, Number of Switching Environment, Sequence of Switching Environment, Number of Highlight Events, Time Since Last Answer Interaction, Number of Created Emails, Sequence of Viewed Emails, Number of Different Email Views, Number of Revisited Emails, Number of Email Views, Sequence of Visited Webpages, Time-Sequence of Spent Time on Webpages, Number of Different Page Visits, Number of Page Visits, Number of Page Revisits.

  7. Cloud-based User Entity Behavior Analytics Log Data Set

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger; Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger (2023). Cloud-based User Entity Behavior Analytics Log Data Set [Dataset]. http://doi.org/10.5281/zenodo.7119953
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger; Max Landauer; Florian Skopik; Georg Höld; Markus Wurzenberger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This respository contains the CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set). The data set contains log events from real users utilizing a cloud storage suitable for User Entity Behavior Analytics (UEBA). Events include logins, file accesses, link shares, config changes, etc. The data set contains around 50 million events generated by more than 5000 distinct users in more than five years (2017-07-07 to 2022-09-29 or 1910 days). The data set is complete except for 109 events missing on 2021-04-22, 2021-08-20, and 2021-09-05 due to database failure. The unpacked file size is around 14.5 GB. A detailed analysis of the data set is provided in [1].

    The logs are provided in JSON format with the following attributes in the first level:

    • id: Unique log line identifier that starts at 1 and increases incrementally, e.g., 1.
    • time: Time stamp of the event in ISO format, e.g., 2021-01-01T00:00:02Z.
    • uid: Unique anonymized identifier for the user generating the event, e.g., old-pink-crane-sharedealer.
    • uidType: Specifier for uid, which is either the user name or IP address for logged out users.
    • type: The action carried out by the user, e.g., file_accessed.
    • params: Additional event parameters (e.g., paths, groups) stored in a nested dictionary.
    • isLocalIP: Optional flag for event origin, which is either internal (true) or external (false).
    • role: Optional user role: consulting, administration, management, sales, technical, or external.
    • location: Optional IP-based geolocation of event origin, including city, country, longitude, latitude, etc.

    In the following data sample, the first object depicts a successful user login (see type: login_successful) and the second object depicts a file access (see type: file_accessed) from a remote location:

    {"params": {"user": "intact-gray-marlin-trademarkagent"}, "type": "login_successful", "time": "2019-11-14T11:26:43Z", "uid": "intact-gray-marlin-trademarkagent", "id": 21567530, "uidType": "name"}

    {"isLocalIP": false, "params": {"path": "/proud-copper-orangutan-artexer/doubtful-plum-ptarmigan-merchant/insufficient-amaranth-earthworm-qualitycontroller/curious-silver-galliform-tradingstandards/incredible-indigo-octopus-printfinisher/wicked-bronze-sloth-claimsmanager/frantic-aquamarine-horse-cleric"}, "type": "file_accessed", "time": "2019-11-14T11:26:51Z", "uid": "graceful-olive-spoonbill-careersofficer", "id": 21567531, "location": {"countryCode": "AT", "countryName": "Austria", "region": "4", "city": "Gmunden", "latitude": 47.915, "longitude": 13.7959, "timezone": "Europe/Vienna", "postalCode": "4810", "metroCode": null, "regionName": "Upper Austria", "isInEuropeanUnion": true, "continent": "Europe", "accuracyRadius": 50}, "uidType": "ipaddress"}

    The data set was generated at the premises of Huemer Group, a midsize IT service provider located in Vienna, Austria. Huemer Group offers a range of Infrastructure-as-a-Service solutions for enterprises, including cloud computing and storage. In particular, their cloud storage solution called hBOX enables customers to upload their data, synchronize them with multiple devices, share files with others, create versions and backups of their documents, collaborate with team members in shared data spaces, and query the stored documents using search terms. The hBOX extends the open-source project Nextcloud with interfaces and functionalities tailored to the requirements of customers.

    The data set comprises only normal user behavior, but can be used to evaluate anomaly detection approaches by simulating account hijacking. We provide an implementation for identifying similar users, switching pairs of users to simulate changes of behavior patterns, and a sample detection approach in our github repo.

    Acknowledgements: Partially funded by the FFG project DECEPT (873980). The authors thank Walter Huemer, Oskar Kruschitz, Kevin Truckenthanner, and Christian Aigner from Huemer Group for supporting the collection of the data set.

    If you use the dataset, please cite the following publication:

    [1] M. Landauer, F. Skopik, G. Höld, and M. Wurzenberger. "A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing". 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]

  8. f

    Example of log file data from PISA 2012 problem solving.

    • plos.figshare.com
    xls
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guanyu Chen; Yan Liu; Yue Mao (2024). Example of log file data from PISA 2012 problem solving. [Dataset]. http://doi.org/10.1371/journal.pone.0304109.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Guanyu Chen; Yan Liu; Yue Mao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example of log file data from PISA 2012 problem solving.

  9. Z

    Data from: Research data and example scripts for the paper "Bayesian...

    • data.niaid.nih.gov
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burger, Sven (2024). Research data and example scripts for the paper "Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6359593
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Andrle, Kas
    Burger, Sven
    Schneider, Philipp-Immanuel
    Plock, Matthias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction

    This publication contains the research data and example scripts for the paper “Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction” [1]. The research data is found in the directory research_data, the example scripts are found in the directory example_scripts.

    The research data contains all necessary information to be able to reconstruct the figures and values given in the paper, as well as all result figures shown. Where possible, the directories contain the necessary scripts to recreate the results themselves, up to stochastic variations.

    The example scripts are intended to show how one can (i), perform a least-square type optimization of a model function (here we focus our efforts on the analytical model functions MGH17 and Gauss3, as described in the paper) using various methods (BTVO, LM, BO, L-BFGS-B, NM, including using derivative information when applicable), and (ii), perform Markov chain Monte Carlo (MCMC) sampling around the found maximum likelihood estimate (MLE) to estimate the uncertainties of the MLE parameter (both using a surrogate model of the actual model function, as well as using the actual model function directly).

    Research data

    Contained are directories for the experimental problem GIXRF, and the two analytical model functions MGH17 and Gauss3. What follows is a listing of directories and the contents:

    gauss3_optimization: Optimization logs for the Gauss3 model function for BTVO, LM, BO, L-BFGS-B, NM (with derivatives when applicable), .npy files used for generating the plots, a benchmark.py file used for the generation of the data, as well as the plots shown in the paper.

    mgh17_optimization: Optimization logs for the MGH17 model function for BTVO, LM, BO, L-BFGS-B, NM (with derivatives when applicable), .npy files used for generating the plots, a benchmark.py file used for the generation of the data, as well as the plots shown in the paper.

    mgh17_mcmc_analytical: Scripts for the creation of the plots (does not use an optimization log), as well as plots shown in the paper. This uses the model function directly to perform the MCMC sampling.

    mgh17_mcmc_surrogate: Optimization log of the MGH17 function used for the creation of the MCMC plots, scripts for the creation of the plots (use the optimization log), as well as plots shown in the paper. This uses a surrogate model to perform the MCMC sampling.

    gixrf_optimization: benchmark.py file to perform the optimization, the optimization logs for the various methods (BTVO, LM, BO, L-BFGS-B, NM), .npy files and scripts used for the creation of the plots, and the plots shown in the paper.

    gixrf_mcmc_supplement: optimization log used for the creation of the plot, pickle file used for the creation of the plot, script to create the MCMC plot.

    gixrf_optimum_difference_supplement: optimization logs of BTVO optimization of the GIXRF problem, scripts to create the difference/error plots shown for the GIXRF problem in the supplement, and the plots themselves.

    Employed software for creating the research data

    The software used in the creation is:

    JCMsuite Analysis and Optimization toolkit, development version, commit d55e99b (the closest commercial release is found in JCMsuite version 5.0.2)

    A list of Python packages installed (excerpt from conda list, name and version)

    corner 2.1.0

    emcee 3.0.2

    jax 0.2.22

    jaxlib 0.1.72

    matplotlib 3.2.1

    numba 0.40.1

    numpy 1.18.1

    pandas 0.24.1

    python 3.7.11

    scikit-optimize 0.7.4

    scipy 1.7.1

    tikzplotlib 0.9.9

    JCMsuite 4.6.3 for the evaluation of the experimental model

    Example scripts

    This directory contains a few sample files that show how parameter reconstructions can be performed using the JCMsuite analysis and optimization toolbox, with a particular focus on the Bayesian target-vector optimization method shown in the paper.

    It also contains example files that show how an uncertainty quantification can be performed using MCMC, both directly using a model function, as well as using a surrogate model of the model function.

    What follows is a listing of the contents of the directory:

    mcmc_mgh17_analytical.py: performs a MCMC analysis of the MGH17 model function directly, without constructing a surrogate model. Uses emcee.

    mcmc_mgh17_surrogate.py: performs a MCMC analysis of the MGH17 model function by constructing a surrogate model of the model function. Uses the JCMsuite analysis and optimization toolbox.

    opt_gauss3.py: performs a parameter reconstruction of the Gauss3 model function using various methods (BTVO, LM, BO, L-BFGS-B, NM, with derivatives when applicable).

    opt_mgh17.py: performs a parameter reconstruction of the MGH17 model function using various methods (BTVO, LM, BO, L-BFGS-B, NM, with derivatives when applicable).

    util/model_functions.py: contains the MGH17 and Gauss3 model functions, their (automatic) derivatives, and objective functions used in the optimizations.

    Requirements to execute the example scripts

    These scripts have been developed and tested under Linux, Debian 10. We have tried to make sure that they would also work in a Windows environment, but can unfortunately give no guarantees for that.

    We mainly use Python to run the reconstructions. To execute the files, a few Python packages have to be installed. In addition to the usual scientific Python stack (NumPy, SciPy, matplotlib, pandas, etc.), the packages jax and jaxlib (for automatic differentiation of Python/NumPy functions), emcee and corner (for MCMC sampling and subsequent plotting of the results) have to be installed.

    This can be achieved for example using pip, e.g.

    pip install -r requirements.txt

    Additionally, JCMsuite has to be installed. For this you can visit [2] and download a free trial version.

    On Linux, the installation has to be added to the PATH, e.g. by adding the following to your .bashrc file:

    export JCMROOT=/FULL/PATH/TO/BASE/DIRECTORY export PATH=$JCMROOT/bin:$PATH export PYTHONPATH=$JCMROOT/ThirdPartySupport/Python:$PYTHONPATH

    Bibliography

    [1] M. Plock, K. Andrle, S. Burger, P.-I. Schneider, Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. Adv. Theory Simul. 5, 2200112 (2022).

    [2] https://jcmwave.com/

  10. User Interface Logs based on mockups or real-life screenshots for task...

    • zenodo.org
    zip
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Martínez-Rojas; Antonio Martínez-Rojas; Antonio Rodríguez Ruiz; Antonio Rodríguez Ruiz; Andrés Jiménez-Ramírez; Andrés Jiménez-Ramírez; José González Enríquez; José González Enríquez; Hajo A. Reijers; Hajo A. Reijers (2025). User Interface Logs based on mockups or real-life screenshots for task mining applications [Dataset]. http://doi.org/10.5281/zenodo.15195200
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Martínez-Rojas; Antonio Martínez-Rojas; Antonio Rodríguez Ruiz; Antonio Rodríguez Ruiz; Andrés Jiménez-Ramírez; Andrés Jiménez-Ramírez; José González Enríquez; José González Enríquez; Hajo A. Reijers; Hajo A. Reijers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset corresponds to the problems analyzed in the work "Assessing Reproducibility in Screenshot-Based Task Mining: A Decision Discovery Perspective," published in the Information Systems Journal.

    The artifacts provided correspond to multiple instances of the execution of a particular process model, covering all its variants in the form of UI Logs. These UI Logs are divided into two groups:

    1. A set where the process is executed in a synthetic environment, using mockups to represent the user interface.
    2. A set where real screenshots of the user interfaces involved in the process are used.

    Additionally, UI Logs are synthetically generated from an original UI Log in both cases.

    For generating the UI Logs, a real-world process based on handling unsubscription requests from users of a telephone company has been selected. This case was selected based on the following criteria (1) the process is replicated from a real company, (2) decision discovery relies on visual elements present in the screenshots, specifically an email attachment and a checkbox in a web form. Thus, the selected process consists of 10 activities, a single decision point, and 4 process variants.

    The dataset includes:

    • UI Logs Folder:
      • Contains UI logs used to evaluate the approach. The dataset contains 10 folders—5 corresponding to logs based on real screenshots and 5 to those using
        mockups. Inside these folders, the dataset is further structured into
        subfolders based on problem characteristics, with names formatted as: ProblemType_LogSize_Balanced, where LogSize is one of {75, 100, 300, 500} and Balanced is either Balanced or Imbalanced. Therefore, each problem subfolder contains the corresponding UI Log and associated screenshots. organized into subfolders based on different problem characteristics. Each subfolder includes:
        • log.csv: A CSV file containing the UI log data.
        • Screenshots and associated metadata:
          • 1_img.png: A sample screenshot image.
          • 1_img.png.json: JSON file containing metadata for the corresponding screenshot.
        • Original obtained evaluation results:
          • flattened_dataset.csv: A flattened version of the dataset used for decision tree analysis.
          • preprocessed_df.csv: Preprocessed data frame used for analysis.
          • decision_tree.log: Log file documenting the decision tree process.
          • CHAID-tree-feature-importance.csv: CSV file detailing feature importance from the CHAID decision tree.
    • Process Discovery Files Folder:
      • Contains the necessary data about the process for the framework to parse the UI logs and identify the decision points within the process. These files include:
        • bpmn.bpmn: BPMN file representing the process model.
        • bpmn.dot: DOT file representing the BPMN process model.
        • pn.dot: DOT file representing the Petri net process model.
        • traceability.json: JSON file mapping decision point branches to rules from decision model.
      • These files map to the files outputted by the third phase of the proposed framework and are mocked for the purpose of restricting the evaluation to the fourth phase of the framework.
    • Scripts Folder:
      • Contains the necessary scripts for processing the UI logs to a format that can be processed by the framework, creating the experiments, populating the database of the framework with the experiment data, running the experiments, and collecting results. The scripts include:
        • collect_results.py: Script to collect experiment results.
        • db_populate.json: Configuration file for populating the database.
        • hierarchy_constructor.py: Script to construct the hierarchy of UI elements.
        • models_populate.json: Configuration file for populating models.
        • process_logs.py: Script to process UI logs.
        • process_reproducibility_data.py: Script to process reproducibility data.
        • process_uielements.py: Script to process UI elements.
        • run_experiments.py: Script to run experiments.
        • run_experiments.sh: Shell script to execute the experiments.

    To create the evaluation objects, we generated event logs of different sizes (|L|) by deriving events from the sample event log. We consider log sizes of {75, 100, 300, 500} events. Each log contains complete process instances, ensuring that if an additional instance exceeds |L|, it is removed.

    To average results across different problem instances, we trained decision trees 30 times on synthetic variations of the dataset, obtaining the mean of the metrics as experiment metadata.

  11. f

    Combined log file obtained with bdmm

    • plos.figshare.com
    txt
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Roger Esquivel Gomez; Cyril Savin; Voahangy Andrianaivoarimanana; Soloandry Rahajandraibe; Lovasoa Nomena Randriantseheno; Zhemin Zhou; Arthur Kocher; Xavier Didelot; Minoarisoa Rajerison; Denise Kühnert (2023). Combined log file obtained with bdmm [Dataset]. http://doi.org/10.1371/journal.pntd.0010362.s005
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS Neglected Tropical Diseases
    Authors
    Luis Roger Esquivel Gomez; Cyril Savin; Voahangy Andrianaivoarimanana; Soloandry Rahajandraibe; Lovasoa Nomena Randriantseheno; Zhemin Zhou; Arthur Kocher; Xavier Didelot; Minoarisoa Rajerison; Denise Kühnert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundPlague is a zoonotic disease caused by the bacterium Yersinia pestis, highly prevalent in the Central Highlands, a mountainous region in the center of Madagascar. After a plague-free period of over 60 years in the northwestern coast city of Mahajanga, the disease reappeared in 1991 and caused several outbreaks until 1999. Previous research indicates that the disease was reintroduced to the city of Mahajanga from the Central Highlands instead of reemerging from a local reservoir. However, it is not clear how many reintroductions occurred and when they took place.Methodology/Principal findingsIn this study we applied a Bayesian phylogeographic model to detect and date migrations of Y. pestis between the two locations that could be linked to the re-emergence of plague in Mahajanga. Genome sequences of 300 Y. pestis strains sampled between 1964 and 2012 were analyzed. Four migrations from the Central Highlands to Mahajanga were detected. Two resulted in persistent transmission in humans, one was responsible for most of the human cases recorded between 1995 and 1999, while the other produced plague cases in 1991 and 1992. We dated the emergence of the Y. pestis sub-branch 1.ORI3, which is only present in Madagascar and Turkey, to the beginning of the 20th century, using a Bayesian molecular dating analysis. The split between 1.ORI3 and its ancestor lineage 1.ORI2 was dated to the second half of the 19th century.Conclusions/SignificanceOur results indicate that two independent migrations from the Central Highlands caused the plague outbreaks in Mahajanga during the 1990s, with both introductions occurring during the early 1980s. They happened over a decade before the detection of human cases, thus the pathogen likely survived in wild reservoirs until the spillover to humans was possible. This study demonstrates the value of Bayesian phylogenetics in elucidating the re-emergence of infectious diseases.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Skopik, Florian (2024). AIT Log Data Set V2.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5789063

AIT Log Data Set V2.0

Explore at:
Dataset updated
Jun 28, 2024
Dataset provided by
Rauber, Andreas
Skopik, Florian
Wurzenberger, Markus
Landauer, Max
Hotwagner, Wolfgang
Frank, Maximilian
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

AIT Log Data Sets

This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

The datasets in this repository have the following structure:

The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.

The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.

The processing directory contains the source code that was used to generate the labels.

The rules directory contains the labeling rules.

The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.

The dataset.yml file specifies the start and end time of the simulation.

The following table summarizes relevant properties of the datasets:

fox

Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00

Attack time: 2022-01-18 11:59 - 2022-01-18 13:15

Scan volume: High

Unpacked size: 26 GB

harrison

Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00

Attack time: 2022-02-08 07:07 - 2022-02-08 08:38

Scan volume: High

Unpacked size: 27 GB

russellmitchell

Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00

Attack time: 2022-01-24 03:01 - 2022-01-24 04:39

Scan volume: Low

Unpacked size: 14 GB

santos

Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00

Attack time: 2022-01-17 11:15 - 2022-01-17 11:59

Scan volume: Low

Unpacked size: 17 GB

shaw

Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00

Attack time: 2022-01-29 14:37 - 2022-01-29 15:21

Scan volume: Low

Data exfiltration is not visible in DNS logs

Unpacked size: 27 GB

wardbeck

Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00

Attack time: 2022-01-23 12:10 - 2022-01-23 12:56

Scan volume: Low

Unpacked size: 26 GB

wheeler

Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00

Attack time: 2022-01-30 07:35 - 2022-01-30 17:53

Scan volume: High

No password cracking in attack chain

Unpacked size: 30 GB

wilson

Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00

Attack time: 2022-02-07 10:57 - 2022-02-07 11:49

Scan volume: High

Unpacked size: 39 GB

The following attacks are launched in the network:

Scans (nmap, WPScan, dirb)

Webshell upload (CVE-2020-24186)

Password cracking (John the Ripper)

Privilege escalation

Remote command execution

Data exfiltration (DNSteal)

Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.

Version history:

AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

If you use the dataset, please cite the following publications:

[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]

[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

Search
Clear search
Close search
Google apps
Main menu