Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.
The datasets in this repository have the following structure:
The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.
The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
The processing directory contains the source code that was used to generate the labels.
The rules directory contains the labeling rules.
The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
The dataset.yml file specifies the start and end time of the simulation.
The following table summarizes relevant properties of the datasets:
fox
Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
Scan volume: High
Unpacked size: 26 GB
harrison
Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
Scan volume: High
Unpacked size: 27 GB
russellmitchell
Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
Scan volume: Low
Unpacked size: 14 GB
santos
Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
Scan volume: Low
Unpacked size: 17 GB
shaw
Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
Scan volume: Low
Data exfiltration is not visible in DNS logs
Unpacked size: 27 GB
wardbeck
Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
Scan volume: Low
Unpacked size: 26 GB
wheeler
Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
Scan volume: High
No password cracking in attack chain
Unpacked size: 30 GB
wilson
Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
Scan volume: High
Unpacked size: 39 GB
The following attacks are launched in the network:
Scans (nmap, WPScan, dirb)
Webshell upload (CVE-2020-24186)
Password cracking (John the Ripper)
Privilege escalation
Remote command execution
Data exfiltration (DNSteal)
Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.
The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:
{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:
type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.
Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.
Version history:
AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]
[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data repository for the LO2 dataset.
Here is an overview of the contents.
lo2-data.zip
This is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.
lo2-sample.zip
This is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize.
data-appendix.pdf
This document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.
lo2-scripts.zip
Various scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.
Version v2: Fixed LogLead version number and minor changes in scripts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
README.txt
Maintenance example belonging to:
The MANTIS Book: Cyber Physical System Based Proactive Collaborative Maintenance Chapter 9, The Future of Maintenance (2019). Lambert Schomaker, Michele Albano, Erkki Jantunen, Luis Lino Ferreira River Publishers (DK) ISBN: 9788793609853, e-ISBN: 9788793609846, https://doi.org/10.13052/rp-9788793609846
The figure .pdf did not make it into the book. Here are the raw data, processed logs and .gnu script to produce it.
Data: event logs on disk failure in two racks of a huge RAID disk system (2009-2016).
disks1.raw disks2.raw
Event logs to RC-filtered time series: RC-filt-disks-log.c do-RC-filter-to-make-spikes-more-visible (bash script) --> disks1.log disks2.log
Constant (horizontal line) indicating the level where users experienced system-down time Disrupted-operations-threshold
disk-replacement-log.gnu disk-replacement-log.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data repository for the LO2 dataset.
Here is an overview of the contents.
lo2-data.zip
This is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.
lo2-sample.zip
This is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize.
data-appendix.pdf
This document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.
lo2-scripts.zip
Various scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.
The entire dataset (all 7 files) contains detailed information on a time series study of log decomposition in interior Alaska. The species studied include white and black spruce, aspen, birch, balsam poplar and aspen starting as green trees. In addition white and black spruce in recently burned sites are included. The study was designed to produce a time series of log decomposition measurements over the next 100 years. The information to be measured on the logs includes weight and density changes over specified time periods, changes in nutrient concentrations, and hemicellulose, cellouse, and lignin concentrations, and changes in the quantity of nutrients and hemicellulose, cellouse and lignin. (This file contains the sample nutrient analysis data for the log decomposition study.)
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Objective: The PIAAC 2012 study was the first fully computer-based large scale assessment in education. During the assessment, user interactions were logged automatically. This means that most of the users’ actions within the assessment tool were recorded and stored with time stamps in separate files called log files. The log files contain paradata for each participant in the domains literacy, numeracy, and problem solving in technology-rich environments. The availability of these log files offers new opportunities to researchers, for instance to reproduce test-taking behavior of individuals and to better understand test-taking behavior.
Method: PIAAC 2012 was conducted August 2011-November 2012 among a representative international sample of around 166000 adults within 24 different countries. The following dataset includes the log files from 17 countries. Each country was allowed to choose their own sampling technique as long as the technique applies full selection probability methods to select a representative sample from the PIAAC target population. The countries were able to oversample particular subgroups of the target population. Persons aged 55-65 and recent immigrants were oversampled in Denmark and persons aged 19-26 were oversampled in Poland. The administration of the background questionnaires was conducted face-to-face using computer assisted personal interviewing (CAPI). After the questionnaire, the respondent completed a computer-based or paper-based cognitive under the supervision of the interviewer in one or two of the following competence domains: literacy, numeracy and problem solving in technology-rich environments.
Variables: With the help of the PIAAC LogDataAnalyzer you can generate a data set. The Log Data Extraction software is a self-contained system that manages activities like data extraction, data cleaning, and visualization of OECD-PIAAC 2012 assessment log data files. It serves as a basis for data related analysis tasks using the tool itself or by exporting the cleaned data to external tools like statistics packages. You can generate the following Variables: Number of Using Cancel Button, Number of Using Help Menu, Time on Task, Time Till the First Interaction, Final Response, Number of Switching Environment, Sequence of Switching Environment, Number of Highlight Events, Time Since Last Answer Interaction, Number of Created Emails, Sequence of Viewed Emails, Number of Different Email Views, Number of Revisited Emails, Number of Email Views, Sequence of Visited Webpages, Time-Sequence of Spent Time on Webpages, Number of Different Page Visits, Number of Page Visits, Number of Page Revisits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This respository contains the CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set). The data set contains log events from real users utilizing a cloud storage suitable for User Entity Behavior Analytics (UEBA). Events include logins, file accesses, link shares, config changes, etc. The data set contains around 50 million events generated by more than 5000 distinct users in more than five years (2017-07-07 to 2022-09-29 or 1910 days). The data set is complete except for 109 events missing on 2021-04-22, 2021-08-20, and 2021-09-05 due to database failure. The unpacked file size is around 14.5 GB. A detailed analysis of the data set is provided in [1].
The logs are provided in JSON format with the following attributes in the first level:
In the following data sample, the first object depicts a successful user login (see type: login_successful) and the second object depicts a file access (see type: file_accessed) from a remote location:
{"params": {"user": "intact-gray-marlin-trademarkagent"}, "type": "login_successful", "time": "2019-11-14T11:26:43Z", "uid": "intact-gray-marlin-trademarkagent", "id": 21567530, "uidType": "name"}
{"isLocalIP": false, "params": {"path": "/proud-copper-orangutan-artexer/doubtful-plum-ptarmigan-merchant/insufficient-amaranth-earthworm-qualitycontroller/curious-silver-galliform-tradingstandards/incredible-indigo-octopus-printfinisher/wicked-bronze-sloth-claimsmanager/frantic-aquamarine-horse-cleric"}, "type": "file_accessed", "time": "2019-11-14T11:26:51Z", "uid": "graceful-olive-spoonbill-careersofficer", "id": 21567531, "location": {"countryCode": "AT", "countryName": "Austria", "region": "4", "city": "Gmunden", "latitude": 47.915, "longitude": 13.7959, "timezone": "Europe/Vienna", "postalCode": "4810", "metroCode": null, "regionName": "Upper Austria", "isInEuropeanUnion": true, "continent": "Europe", "accuracyRadius": 50}, "uidType": "ipaddress"}
The data set was generated at the premises of Huemer Group, a midsize IT service provider located in Vienna, Austria. Huemer Group offers a range of Infrastructure-as-a-Service solutions for enterprises, including cloud computing and storage. In particular, their cloud storage solution called hBOX enables customers to upload their data, synchronize them with multiple devices, share files with others, create versions and backups of their documents, collaborate with team members in shared data spaces, and query the stored documents using search terms. The hBOX extends the open-source project Nextcloud with interfaces and functionalities tailored to the requirements of customers.
The data set comprises only normal user behavior, but can be used to evaluate anomaly detection approaches by simulating account hijacking. We provide an implementation for identifying similar users, switching pairs of users to simulate changes of behavior patterns, and a sample detection approach in our github repo.
Acknowledgements: Partially funded by the FFG project DECEPT (873980). The authors thank Walter Huemer, Oskar Kruschitz, Kevin Truckenthanner, and Christian Aigner from Huemer Group for supporting the collection of the data set.
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, G. Höld, and M. Wurzenberger. "A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing". 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of log file data from PISA 2012 problem solving.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction
This publication contains the research data and example scripts for the paper “Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction” [1]. The research data is found in the directory research_data, the example scripts are found in the directory example_scripts.
The research data contains all necessary information to be able to reconstruct the figures and values given in the paper, as well as all result figures shown. Where possible, the directories contain the necessary scripts to recreate the results themselves, up to stochastic variations.
The example scripts are intended to show how one can (i), perform a least-square type optimization of a model function (here we focus our efforts on the analytical model functions MGH17 and Gauss3, as described in the paper) using various methods (BTVO, LM, BO, L-BFGS-B, NM, including using derivative information when applicable), and (ii), perform Markov chain Monte Carlo (MCMC) sampling around the found maximum likelihood estimate (MLE) to estimate the uncertainties of the MLE parameter (both using a surrogate model of the actual model function, as well as using the actual model function directly).
Research data
Contained are directories for the experimental problem GIXRF, and the two analytical model functions MGH17 and Gauss3. What follows is a listing of directories and the contents:
gauss3_optimization: Optimization logs for the Gauss3 model function for BTVO, LM, BO, L-BFGS-B, NM (with derivatives when applicable), .npy files used for generating the plots, a benchmark.py file used for the generation of the data, as well as the plots shown in the paper.
mgh17_optimization: Optimization logs for the MGH17 model function for BTVO, LM, BO, L-BFGS-B, NM (with derivatives when applicable), .npy files used for generating the plots, a benchmark.py file used for the generation of the data, as well as the plots shown in the paper.
mgh17_mcmc_analytical: Scripts for the creation of the plots (does not use an optimization log), as well as plots shown in the paper. This uses the model function directly to perform the MCMC sampling.
mgh17_mcmc_surrogate: Optimization log of the MGH17 function used for the creation of the MCMC plots, scripts for the creation of the plots (use the optimization log), as well as plots shown in the paper. This uses a surrogate model to perform the MCMC sampling.
gixrf_optimization: benchmark.py file to perform the optimization, the optimization logs for the various methods (BTVO, LM, BO, L-BFGS-B, NM), .npy files and scripts used for the creation of the plots, and the plots shown in the paper.
gixrf_mcmc_supplement: optimization log used for the creation of the plot, pickle file used for the creation of the plot, script to create the MCMC plot.
gixrf_optimum_difference_supplement: optimization logs of BTVO optimization of the GIXRF problem, scripts to create the difference/error plots shown for the GIXRF problem in the supplement, and the plots themselves.
Employed software for creating the research data
The software used in the creation is:
JCMsuite Analysis and Optimization toolkit, development version, commit d55e99b (the closest commercial release is found in JCMsuite version 5.0.2)
A list of Python packages installed (excerpt from conda list, name and version)
corner 2.1.0
emcee 3.0.2
jax 0.2.22
jaxlib 0.1.72
matplotlib 3.2.1
numba 0.40.1
numpy 1.18.1
pandas 0.24.1
python 3.7.11
scikit-optimize 0.7.4
scipy 1.7.1
tikzplotlib 0.9.9
JCMsuite 4.6.3 for the evaluation of the experimental model
Example scripts
This directory contains a few sample files that show how parameter reconstructions can be performed using the JCMsuite analysis and optimization toolbox, with a particular focus on the Bayesian target-vector optimization method shown in the paper.
It also contains example files that show how an uncertainty quantification can be performed using MCMC, both directly using a model function, as well as using a surrogate model of the model function.
What follows is a listing of the contents of the directory:
mcmc_mgh17_analytical.py: performs a MCMC analysis of the MGH17 model function directly, without constructing a surrogate model. Uses emcee.
mcmc_mgh17_surrogate.py: performs a MCMC analysis of the MGH17 model function by constructing a surrogate model of the model function. Uses the JCMsuite analysis and optimization toolbox.
opt_gauss3.py: performs a parameter reconstruction of the Gauss3 model function using various methods (BTVO, LM, BO, L-BFGS-B, NM, with derivatives when applicable).
opt_mgh17.py: performs a parameter reconstruction of the MGH17 model function using various methods (BTVO, LM, BO, L-BFGS-B, NM, with derivatives when applicable).
util/model_functions.py: contains the MGH17 and Gauss3 model functions, their (automatic) derivatives, and objective functions used in the optimizations.
Requirements to execute the example scripts
These scripts have been developed and tested under Linux, Debian 10. We have tried to make sure that they would also work in a Windows environment, but can unfortunately give no guarantees for that.
We mainly use Python to run the reconstructions. To execute the files, a few Python packages have to be installed. In addition to the usual scientific Python stack (NumPy, SciPy, matplotlib, pandas, etc.), the packages jax and jaxlib (for automatic differentiation of Python/NumPy functions), emcee and corner (for MCMC sampling and subsequent plotting of the results) have to be installed.
This can be achieved for example using pip, e.g.
pip install -r requirements.txt
Additionally, JCMsuite has to be installed. For this you can visit [2] and download a free trial version.
On Linux, the installation has to be added to the PATH, e.g. by adding the following to your .bashrc file:
export JCMROOT=/FULL/PATH/TO/BASE/DIRECTORY export PATH=$JCMROOT/bin:$PATH export PYTHONPATH=$JCMROOT/ThirdPartySupport/Python:$PYTHONPATH
Bibliography
[1] M. Plock, K. Andrle, S. Burger, P.-I. Schneider, Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. Adv. Theory Simul. 5, 2200112 (2022).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to the problems analyzed in the work "Assessing Reproducibility in Screenshot-Based Task Mining: A Decision Discovery Perspective," published in the Information Systems Journal.
The artifacts provided correspond to multiple instances of the execution of a particular process model, covering all its variants in the form of UI Logs. These UI Logs are divided into two groups:
Additionally, UI Logs are synthetically generated from an original UI Log in both cases.
For generating the UI Logs, a real-world process based on handling unsubscription requests from users of a telephone company has been selected. This case was selected based on the following criteria (1) the process is replicated from a real company, (2) decision discovery relies on visual elements present in the screenshots, specifically an email attachment and a checkbox in a web form. Thus, the selected process consists of 10 activities, a single decision point, and 4 process variants.
The dataset includes:
ProblemType_LogSize_Balanced
, where LogSize
is one of {75, 100, 300, 500} and Balanced
is either Balanced
or Imbalanced
. Therefore, each problem subfolder contains the corresponding UI Log and associated screenshots. organized into subfolders based on different problem characteristics. Each subfolder includes:
log.csv
: A CSV file containing the UI log data.1_img.png
: A sample screenshot image.1_img.png.json
: JSON file containing metadata for the corresponding screenshot.flattened_dataset.csv
: A flattened version of the dataset used for decision tree analysis.preprocessed_df.csv
: Preprocessed data frame used for analysis.decision_tree.log
: Log file documenting the decision tree process.CHAID-tree-feature-importance.csv
: CSV file detailing feature importance from the CHAID decision tree.bpmn.bpmn
: BPMN file representing the process model.bpmn.dot
: DOT file representing the BPMN process model.pn.dot
: DOT file representing the Petri net process model.traceability.json
: JSON file mapping decision point branches to rules from decision model.collect_results.py
: Script to collect experiment results.db_populate.json
: Configuration file for populating the database.hierarchy_constructor.py
: Script to construct the hierarchy of UI elements.models_populate.json
: Configuration file for populating models.process_logs.py
: Script to process UI logs.process_reproducibility_data.py
: Script to process reproducibility data.process_uielements.py
: Script to process UI elements.run_experiments.py
: Script to run experiments.run_experiments.sh
: Shell script to execute the experiments.To create the evaluation objects, we generated event logs of different sizes (|L|) by deriving events from the sample event log. We consider log sizes of {75, 100, 300, 500} events. Each log contains complete process instances, ensuring that if an additional instance exceeds |L|, it is removed.
To average results across different problem instances, we trained decision trees 30 times on synthetic variations of the dataset, obtaining the mean of the metrics as experiment metadata.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundPlague is a zoonotic disease caused by the bacterium Yersinia pestis, highly prevalent in the Central Highlands, a mountainous region in the center of Madagascar. After a plague-free period of over 60 years in the northwestern coast city of Mahajanga, the disease reappeared in 1991 and caused several outbreaks until 1999. Previous research indicates that the disease was reintroduced to the city of Mahajanga from the Central Highlands instead of reemerging from a local reservoir. However, it is not clear how many reintroductions occurred and when they took place.Methodology/Principal findingsIn this study we applied a Bayesian phylogeographic model to detect and date migrations of Y. pestis between the two locations that could be linked to the re-emergence of plague in Mahajanga. Genome sequences of 300 Y. pestis strains sampled between 1964 and 2012 were analyzed. Four migrations from the Central Highlands to Mahajanga were detected. Two resulted in persistent transmission in humans, one was responsible for most of the human cases recorded between 1995 and 1999, while the other produced plague cases in 1991 and 1992. We dated the emergence of the Y. pestis sub-branch 1.ORI3, which is only present in Madagascar and Turkey, to the beginning of the 20th century, using a Bayesian molecular dating analysis. The split between 1.ORI3 and its ancestor lineage 1.ORI2 was dated to the second half of the 19th century.Conclusions/SignificanceOur results indicate that two independent migrations from the Central Highlands caused the plague outbreaks in Mahajanga during the 1990s, with both introductions occurring during the early 1980s. They happened over a decade before the detection of human cases, thus the pathogen likely survived in wild reservoirs until the spillover to humans was possible. This study demonstrates the value of Bayesian phylogenetics in elucidating the re-emergence of infectious diseases.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.
The datasets in this repository have the following structure:
The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.
The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
The processing directory contains the source code that was used to generate the labels.
The rules directory contains the labeling rules.
The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
The dataset.yml file specifies the start and end time of the simulation.
The following table summarizes relevant properties of the datasets:
fox
Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
Scan volume: High
Unpacked size: 26 GB
harrison
Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
Scan volume: High
Unpacked size: 27 GB
russellmitchell
Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
Scan volume: Low
Unpacked size: 14 GB
santos
Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
Scan volume: Low
Unpacked size: 17 GB
shaw
Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
Scan volume: Low
Data exfiltration is not visible in DNS logs
Unpacked size: 27 GB
wardbeck
Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
Scan volume: Low
Unpacked size: 26 GB
wheeler
Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
Scan volume: High
No password cracking in attack chain
Unpacked size: 30 GB
wilson
Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
Scan volume: High
Unpacked size: 39 GB
The following attacks are launched in the network:
Scans (nmap, WPScan, dirb)
Webshell upload (CVE-2020-24186)
Password cracking (John the Ripper)
Privilege escalation
Remote command execution
Data exfiltration (DNSteal)
Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.
The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:
{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:
type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.
Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.
Version history:
AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]
[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]