87 datasets found
  1. AIT Log Data Set V1.1

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Oct 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber; Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber (2023). AIT Log Data Set V1.1 [Dataset]. http://doi.org/10.5281/zenodo.4264796
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber; Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.

    The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".

    The data directory is structured as follows. Each directory mail.

    Setup details of the web servers:

    • OS: Debian Stretch 9.11.6
    • Services:
      • Apache2
      • PHP7
      • Exim 4.89
      • Horde 5.2.22
      • OkayCMS 2.3.4
      • Suricata
      • ClamAV
      • MariaDB

    Setup details of user machines:

    • OS: Ubuntu Bionic
    • Services:
      • Chromium
      • Firefox

    User host machines are assigned to web servers in the following way:

    • mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
    • mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
    • mail.insect.com is accessed by users from host machines user-{4, 9}
    • mail.onion.com is accessed by users from host machines user-{7, 10}

    The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):

    • Attack 1: multi-step attack with sequential execution of the following attacks:
      • nmap scan
      • nikto scan
      • smtp-user-enum tool for account enumeration
      • hydra brute force login
      • webshell upload through Horde exploit (CVE-2019-9858)
      • privilege escalation through Exim exploit (CVE-2019-10149)
    • Attack 2: webshell injection through malicious cookie (CVE-2019-16885)

    Attacks are launched from the following user host machines. In each of the corresponding directories user-

    • user-6 attacks mail.cup.com
    • user-5 attacks mail.spiral.com
    • user-4 attacks mail.insect.com
    • user-7 attacks mail.onion.com

    The log data collected from the web servers includes

    • Apache access and error logs
    • syscall logs collected with the Linux audit daemon
    • suricata logs
    • exim logs
    • auth logs
    • daemon logs
    • mail logs
    • syslogs
    • user logs


    Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.

    Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.

    Version history and related data sets:

    • AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
      • AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
    • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publication:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

  2. E-Commerce Website Logs

    • kaggle.com
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KZ Data Lover (2023). E-Commerce Website Logs [Dataset]. https://www.kaggle.com/datasets/kzmontage/e-commerce-website-logs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    Kaggle
    Authors
    KZ Data Lover
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This is a E-commerce website logs data created for helping the data analysts to practice exploratory data analysis and data visualization. The dataset has data on when the website was accessed, IP address of the source, Country, language in which website was accessed, amount of sales made by that IP address.

    Included columns:

    Time and duration of of accessing the website
    Country, Language & Platform in which it was accessed
    No. of bytes used & IP address of the person accessing website
    Sales or return amount of that person

  3. AIT Log Data Set V2.0

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2024). AIT Log Data Set V2.0 [Dataset]. http://doi.org/10.5281/zenodo.5789064
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

    In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

    The datasets in this repository have the following structure:

    • The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather/.
    • The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
    • The processing directory contains the source code that was used to generate the labels.
    • The rules directory contains the labeling rules.
    • The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
    • The dataset.yml file specifies the start and end time of the simulation.

    The following table summarizes relevant properties of the datasets:

    • fox
      • Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
      • Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
      • Scan volume: High
      • Unpacked size: 26 GB
    • harrison
      • Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
      • Scan volume: High
      • Unpacked size: 27 GB
    • russellmitchell
      • Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
      • Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
      • Scan volume: Low
      • Unpacked size: 14 GB
    • santos
      • Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
      • Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
      • Scan volume: Low
      • Unpacked size: 17 GB
    • shaw
      • Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
      • Scan volume: Low
      • Data exfiltration is not visible in DNS logs
      • Unpacked size: 27 GB
    • wardbeck
      • Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
      • Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
      • Scan volume: Low
      • Unpacked size: 26 GB
    • wheeler
      • Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
      • Scan volume: High
      • No password cracking in attack chain
      • Unpacked size: 30 GB
    • wilson
      • Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
      • Scan volume: High
      • Unpacked size: 39 GB

    The following attacks are launched in the network:

    • Scans (nmap, WPScan, dirb)
    • Webshell upload (CVE-2020-24186)
    • Password cracking (John the Ripper)
    • Privilege escalation
    • Remote command execution
    • Data exfiltration (DNSteal)

    Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

    The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

    {"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

    type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

    Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.

    Version history:

    • AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
    • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,

  4. W

    Archive Query Log 2022

    • webis.de
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Benno Stein; Matthias Hagen (2023). Archive Query Log 2022 [Dataset]. https://webis.de/data/aql-22.html
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    Friedrich Schiller University Jena
    Leipzig University
    Authors
    Martin Potthast; Benno Stein; Matthias Hagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AQL-22 is a query log collected at the Internet Archive over the last 25 years. Includes 356 M queries, 137 M search result pages, and 1.4 B billion search results across 550 search providers. The AQL-22 is the first public query log that is on par with commercial logs with respect to size, scope, and diversity. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

  5. Server Logs

    • kaggle.com
    Updated Oct 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishnu U (2021). Server Logs [Dataset]. https://www.kaggle.com/vishnu0399/server-logs/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vishnu U
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The dataset is a synthetically generated server log based on Apache Server Logging Format. Each line corresponds to each log entry. The log entry has the following parameters :

    Components in Log Entry :

    • IP of client: This refers to the IP address of the client that sent the request to the server.
    • Remote Log Name: Remote name of the User performing the request. In the majority of the applications, this is confidential information and is hidden or not available.
    • User ID: The ID of the user performing the request. In the majority of the applications, this is a piece of confidential information and is hidden or not available.
    • Date and Time in UTC format: The date and time of the request are represented in UTC format as follows: - Day/Month/Year:Hour:Minutes: Seconds +Time-Zone-Correction.
    • Request Type: The type of request (GET, POST, PUT, DELETE) that the server got. This depends on the operation that the request will do.
    • API: The API of the website to which the request is related. Example: When a user accesses a carton shopping website, the API comes as /usr/cart.
    • Protocol and Version: Protocol used for connecting with server and its version.
    • Status Code: Status code that the server returned for the request. Eg: 404 is sent when a requested resource is not found. 200 is sent when the request was successfully served.
    • Byte: The amount of data in bytes that was sent back to the client.
    • Referrer: The websites/source from where the user was directed to the current website. If none it is represented by “-“.
    • UA String: The user agent string contains details of the browser and the host device (like the name, version, device type etc.).
    • Response Time: The response time the server took to serve the request. This is the difference between the timestamps when the request was received and when the request was served.

    Content

    The dataset consists of two files - - logfiles.log is the actual log file in text format - TestFileGenerator.py is the synthetic log file generator. The number of log entries required can be edited in the code.

  6. Internet users that have signed into websites with a Facebook login in...

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Internet users that have signed into websites with a Facebook login in Poland 2018 [Dataset]. https://www.statista.com/statistics/986974/poland-facebook-login-used-to-sign-in-to-other-websites/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 2018
    Area covered
    Poland
    Description

    This statistic shows the share of internet users that have signed into websites using their Facebook login in Poland as of ********. According to the survey results, ********* of respondents had logged into other websites automatically using their Facebook account. The source also notes that Internet users up to 24 years of age used this logging mechanism more often than people over 35 years of age.

  7. EDGAR Log File Data Sets

    • catalog.data.gov
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EDGAR Business Office (2025). EDGAR Log File Data Sets [Dataset]. https://catalog.data.gov/dataset/edgar-log-file-data-set
    Explore at:
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Electronic Data Gathering, Analysis, and Retrievalhttp://www.sec.gov/edgar.shtml
    Description

    The data sets provide information on internet search traffic for EDGAR filings through SEC.gov.

  8. m

    Data from: Pillar 3: Pre-processed web server log file dataset of the...

    • data.mendeley.com
    Updated Dec 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michal Munk (2021). Pillar 3: Pre-processed web server log file dataset of the banking institution [Dataset]. http://doi.org/10.17632/5bvkm76sdc.1
    Explore at:
    Dataset updated
    Dec 6, 2021
    Authors
    Michal Munk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset represents the pre-processed web server log file of the commercial bank. The source of data is the web server of the bank and keeps access of web users starting the year 2009 till 2012. It contains accesses to the bank website during and after the financial crisis. Unnecessary data saved by the web server was removed to keep the focus only on the textual content of the website. Many variables were added to the original log file to make the analysis workable. To keep the privacy of website users, sensitive information in the log file were anonymized. The dataset offers the way to understand the behaviour of stakeholders during and after the crisis and how they comply with the Basel regulations.

  9. L

    Log Analysis Service Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Log Analysis Service Report [Dataset]. https://www.archivemarketresearch.com/reports/log-analysis-service-566234
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    May 16, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Log Analysis Service market is experiencing robust growth, driven by the increasing adoption of cloud computing, the proliferation of IoT devices generating massive amounts of data, and the rising need for enhanced security and compliance. Businesses across all sectors are generating exponentially more log data, demanding sophisticated solutions for real-time analysis, anomaly detection, and security threat identification. This demand is fueling the market's expansion. Let's assume, for illustrative purposes, a 2025 market size of $15 billion and a Compound Annual Growth Rate (CAGR) of 15% over the forecast period (2025-2033). This implies a significant market expansion, reaching an estimated value exceeding $50 billion by 2033. Key market segments include cloud-based and web-based solutions catering to both SMEs and large enterprises. The competitive landscape is characterized by a mix of established players like Splunk, Datadog, and Sumo Logic, alongside cloud giants such as Microsoft and Google, and open-source alternatives like Apache. The market's growth is further propelled by advancements in AI and machine learning, enabling more accurate and proactive log analysis. Regional variations exist, with North America currently holding a significant market share due to early adoption and a strong technological ecosystem. However, Asia-Pacific is projected to witness the fastest growth rate due to increasing digitalization and expanding IT infrastructure. Restraints to market growth include the complexity of deploying and managing log analysis solutions, the need for skilled personnel, and the cost associated with implementation and maintenance. Despite these challenges, the market outlook for Log Analysis Services remains overwhelmingly positive, indicating continued substantial investment and innovation in this critical area of IT infrastructure. The rise of security information and event management (SIEM) solutions integrated with log analysis further contributes to market expansion.

  10. Site A2 - Event Log / Derived Data

    • osti.gov
    Updated Jul 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bodini, Nicola; Letizia, Stefano (2025). Site A2 - Event Log / Derived Data [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2568472
    Explore at:
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Pacific Northwest National Laboratory
    Authors
    Bodini, Nicola; Letizia, Stefano
    Description

    This dataset contains the event log table with 10-minute wind statistics from the scanning lidar at AWAKEN's site A2. This is a good dataset to start from for people unfamiliar with the AWAKEN project.

  11. AIT Alert Data Set

    • zenodo.org
    csv, zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Markus Wurzenberger; Max Landauer; Florian Skopik; Markus Wurzenberger (2024). AIT Alert Data Set [Dataset]. http://doi.org/10.5281/zenodo.8263181
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Markus Wurzenberger; Max Landauer; Florian Skopik; Markus Wurzenberger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the AIT Alert Data Set (AIT-ADS), a collection of synthetic alerts suitable for evaluation of alert aggregation, alert correlation, alert filtering, and attack graph generation approaches. The alerts were forensically generated from the AIT Log Data Set V2 (AIT-LDSv2) and origin from three intrusion detection systems, namely Suricata, Wazuh, and AMiner. The data sets comprise eight scenarios, each of which has been targeted by a multi-step attack with attack steps such as scans, web application exploits, password cracking, remote command execution, privilege escalation, etc. Each scenario and attack chain has certain variations so that attack manifestations and resulting alert sequences vary in each scenario; this means that the data set allows to develop and evaluate approaches that compute similarities of attack chains or merge them into meta-alerts. Since only few benchmark alert data sets are publicly available, the AIT-ADS was developed to address common issues in the research domain of multi-step attack analysis; specifically, the alert data set contains many false positives caused by normal user behavior (e.g., user login attempts or software updates), heterogeneous alert formats (although all alerts are in JSON format, their fields are different for each IDS), repeated executions of attacks according to an attack plan, collection of alerts from diverse log sources (application logs and network traffic) and all components in the network (mail server, web server, DNS, firewall, file share, etc.), and labels for attack phases. For more information on how this alert data set was generated, check out our paper accompanying this data set [1] or our GitHub repository. More information on the original log data set, including a detailed description of scenarios and attacks, can be found in [2].

    The alert data set contains two files for each of the eight scenarios, and a file for their labels:

    • contains alerts from AMiner IDS
    • contains alerts from Wazuh IDS and Suricata IDS
    • labels.csv contains the start and end times of attack phases in each scenario

    Beside false positive alerts, the alerts in the AIT-ADS correspond to the following attacks:

    • Scans (nmap, WPScan, dirb)
    • Webshell upload (CVE-2020-24186)
    • Password cracking (John the Ripper)
    • Privilege escalation
    • Remote command execution
    • Data exfiltration (DNSteal) and stopped service

    The total number of alerts involved in the data set is 2,655,821, of which 2,293,628 origin from Wazuh, 306,635 origin from Suricata, and 55,558 origin from AMiner. The numbers of alerts in each scenario are as follows. fox: 473,104; harrison: 593,948; russellmitchell: 45,544; santos: 130,779; shaw: 70,782; wardbeck: 91,257; wheeler: 616,161; wilson: 634,246.

    Acknowledgements: Partially funded by the European Defence Fund (EDF) projects AInception (101103385) and NEWSROOM (101121403), and the FFG project PRESENT (FO999899544). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them.

    If you use the AIT-ADS, please cite the following publications:

    [1] Landauer, M., Skopik, F., Wurzenberger, M. (2024): Introducing a New Alert Data Set for Multi-Step Attack Analysis. Proceedings of the 17th Cyber Security Experimentation and Test Workshop. [PDF]

    [2] Landauer M., Skopik F., Frank M., Hotwagner W., Wurzenberger M., Rauber A. (2023): Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482. [PDF]

  12. EDGAR Log Files (2014 - 2016)

    • redivis.com
    application/jsonl +7
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2025). EDGAR Log Files (2014 - 2016) [Dataset]. https://redivis.com/datasets/e16k-fn86c13fx
    Explore at:
    sas, spss, application/jsonl, csv, stata, parquet, arrow, avroAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Graduate School of Business Library
    Time period covered
    Jan 1, 2014 - Dec 31, 2016
    Description

    Abstract

    The EDGAR log file data set provides information on internet search traffic for EDGAR filings through SEC.gov. The data sets contain information extracted from log files from the EDGAR Archive on SEC.gov, and the information can be used to infer user access statistics.

    The current version of this dataset covers search traffic from January 1, 2014 through December 31, 2016.

    Methodology

    Due to the substantial volume of the raw EDGAR Log Files data set, we (Stanford GSB) implemented a series of transformations aimed at reducing its size while retaining essential information needed for research. Below is a summary of the modifications applied to the raw data, resulting in the four tables currently available in this Redivis dataset:

    raw_single_day_per_year:

    • This table contains a single unprocessed day of Edgar Log Data (June 15) from each year. June 15 was selected because it is near the midpoint of the year and is not a U.S. federal holiday.
    • This table was created to help researchers examine a sample of raw data to understand how aggregation was performed in other tables and potentially identify trends using all the unfiltered fields

    %3C!-- --%3E

    aggregated_{YEAR}:

    • **Filter rows **to include only those with code of value '200' with doc/extention values ending in htm, txt, xml, pdf, sgml, html, or xsd

    %3C!-- --%3E

    • **Remove fields %3Cstrong%3Ecik%3C/strong%3E, %3Cstrong%3Etime%3C/strong%3E, %3Cstrong%3Eidx%3C/strong%3E, %3Cstrong%3Esize%3C/strong%3E, and **%3Cstrong%3Ebrowser%3C/strong%3E. Our reasoning for removal of these fields: cik can be obtained through merging with our EDGAR Filings dataset using accession; idx shouldn't change over time for the same doc can be manually recreated via transform of doc; browser is NULL in more than 99.99% of rows across logs and is fully NULL for many dates; size varies according to doc which we have aggregated to reduce size; time does not have a time zone specified and daily data granularity is likely sufficient for research purposes
    • **Aggregate identical rows into a **%3Cstrong%3Edoc_count%3C/strong%3E to represent the number of times a IP viewed a filing each day while keeping the same browser metadata/parameters
    • **Filter aggregate data **to remove rows where doc_count %3E 10,000
    • Standardize data by ensuring relevant field types align with the expected usage by researchers

    %3C!-- --%3E

    raw_{YEAR}:

    • These tables contain a year of unprocessed Edgar Log Data.
    • This table was created to help researchers to use the raw data to potentially identify trends using all the unfiltered fields

    %3C!-- --%3E

    Usage

    From the SEC Edgar Log Website:

    • *The full 2003 - 2017 data set does not have SEC IP addresses for some periods because, at the time, SEC users were not routed to EDGAR the same way as external visitors. For those periods of time, SEC IP addresses do not appear in the logs. *
    • Due to certain limitations, including the existence of lost or damaged files, the information assembled does not capture all SEC.gov website traffic. In addition, it is possible inaccuracies or other errors were introduced into the data during the process of extracting and compiling the data.

    %3C!-- --%3E

  13. e

    Longitudinal navigation log data on the Radboud University web domain -...

    • b2find.eudat.eu
    Updated Apr 28, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Longitudinal navigation log data on the Radboud University web domain - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/31f242e7-01fe-5a59-85f3-0fb23d51b0a9
    Explore at:
    Dataset updated
    Apr 28, 2016
    Description

    We have collected the access logs for our university's web domain over a time span of 4.5 years. We now release the pre-processed web server log of a 3-month period for research into user navigation behavior. We preprocessed the data so that only successful GET requests of web pages by non-bot users are kept. The information that is included per entry is: unique user id, timestamp, GET request (URL), status code, the size of the object returned to the client, and the referrer URL. The resulting size of the 3-month collection is 9.6M page visits (190K unique URLs) by 744K unique visitors. The data collection allows for research on, among other things, user navigation, browsing and stopping behavior and web user clustering.

  14. Log files data from online store

    • zenodo.org
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grzegorz Chodak; Grzegorz Chodak; Yash Chawla; Yash Chawla; Katarzyna Kubicz; Katarzyna Kubicz (2020). Log files data from online store [Dataset]. http://doi.org/10.5281/zenodo.3251889
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Grzegorz Chodak; Grzegorz Chodak; Yash Chawla; Yash Chawla; Katarzyna Kubicz; Katarzyna Kubicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As the number of online stores as well as buyers are increasing rapidly, researchers are working on understanding and improving the performance of online stores by studying customer behavior, interests, engagement etc., along with the technical aspects of online stores. This however requires access to log files of real-world. With this objective in mind, we have prepared and made publicly available high-frequency data-set containing one month of log files from an actual and popular Polish online store. This data-set can provide insights to user behavior as well as performance of the online store.

  15. Network Traffic Dataset

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravikumar Gattu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

    Content :

    This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

    The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

    Dataset Columns:

    No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

    Acknowledgements :

    I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

    Ravikumar Gattu , Susmitha Choppadandi

    Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

    **Dataset License: ** CC0: Public Domain

    Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    ML techniques benefits from this Dataset :

    This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

    1. Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

    2. Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

    3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.

  16. g

    Website Traffic Dataset

    • gts.ai
    json
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Website Traffic Dataset [Dataset]. https://gts.ai/dataset-download/website-traffic-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Explore our detailed website traffic dataset featuring key metrics like page views, session duration, bounce rate, traffic source, and conversion rates.

  17. Web browser useragent and activity tracking data

    • zenodo.org
    bz2
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucz Geza; Lucz Geza (2024). Web browser useragent and activity tracking data [Dataset]. http://doi.org/10.5281/zenodo.14497695
    Explore at:
    bz2Available download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucz Geza; Lucz Geza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2019
    Description

    600 000 000 web traffic records normalized into MySQL tables using TokuDB storage, complete with original web server response codes. Suitable for browser data and trend analysis as well as AI training of exploit and bot detection algorithms. The data had been collected from multiple Apache 2.x web servers across 8000+ domain names with special care for GDPR compliance.

  18. f

    Moodle Course Logs of a Brazilian Higher Education Institution

    • figshare.com
    zip
    Updated Nov 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernardo Pereira Nunes (2018). Moodle Course Logs of a Brazilian Higher Education Institution [Dataset]. http://doi.org/10.6084/m9.figshare.7335860.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 13, 2018
    Dataset provided by
    figshare
    Authors
    Bernardo Pereira Nunes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the records of anonymised user interactions in seven online courses at a Higher Education institution in Brazil. For each course, the dataset covers a period spanning from 2017.1 to 2018.1 equivalent to three Brazilian academic periods. All online courses used the Moodle learning platform.The dataset covers the following courses:F - An introductory course in Philosophy - mandatory for all studentsC - An introductory course in Religion - mandatory for all studentsS - An introductory course in Political Theory - mandatory for students of the School of Humanities and Social SciencesM1 - Differential and Difference Equations course - mandatory for students of the School of Engineering and Exact SciencesM2 - Single Variable Calculus course - mandatory for students of the School of Engineering and Exact SciencesE9 - An introductory course in the Design of Control Systems - mandatory for students of the School of Industrial EngineeringE0 - Foundations of Engineering course - mandatory for all students of the School of EngineeringThe data is compressed in .zip format and can be uncompressed by standard compression utilities. Each course has three separate files grouped by user interactions from different academic periods. For example, the records for the course 'F' are split into F1, F2 and F3. F1 covers the records of the first academic period whereas F2 and F3 contain the records for the second and third academic periods respectively. Note that each instance of a course is independent and that the same student (identified by the same id) may only occur in the same course but in different academic periods iff s/he failed and opted to retake that course in one of the following courses covered by the data available here. The student id is preserved among the courses and academic periods.A description of the log fields contained in this dataset can be found at: https://docs.moodle.org/dev/Event_2#Information_contained_in_events

  19. Devices used to log into online banking in Czechia 2023, by security

    • statista.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Devices used to log into online banking in Czechia 2023, by security [Dataset]. https://www.statista.com/statistics/1373188/czechia-devices-used-to-log-into-online-banking-by-security/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2023
    Area covered
    Czechia
    Description

    As of October 2023, over ** percent of people in Czechia logged into their internet banking only on the devices they solely controlled, and they knew the security settings. Around ** percent used different (such as work) or shared devices but still knew the security settings.

  20. sumu-log.com Website Traffic, Ranking, Analytics [June 2025]

    • fadxfab.com
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semrush (2025). sumu-log.com Website Traffic, Ranking, Analytics [June 2025] [Dataset]. https://fadxfab.com/website/sumu-log.com/overview/
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset authored and provided by
    Semrushhttps://fr.semrush.com/
    License

    https://fadxfab.com/company/legal/terms-of-service/https://fadxfab.com/company/legal/terms-of-service/

    Time period covered
    Jul 12, 2025
    Area covered
    Worldwide
    Variables measured
    visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
    Measurement technique
    Semrush Traffic Analytics; Click-stream data
    Description

    sumu-log.com is ranked #8945 in JP with 287.87K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber; Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber (2023). AIT Log Data Set V1.1 [Dataset]. http://doi.org/10.5281/zenodo.4264796
Organization logo

AIT Log Data Set V1.1

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Oct 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber; Max Landauer; Florian Skopik; Markus Wurzenberger; Wolfgang Hotwagner; Andreas Rauber
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

AIT Log Data Sets

This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.

The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".

The data directory is structured as follows. Each directory mail.

Setup details of the web servers:

  • OS: Debian Stretch 9.11.6
  • Services:
    • Apache2
    • PHP7
    • Exim 4.89
    • Horde 5.2.22
    • OkayCMS 2.3.4
    • Suricata
    • ClamAV
    • MariaDB

Setup details of user machines:

  • OS: Ubuntu Bionic
  • Services:
    • Chromium
    • Firefox

User host machines are assigned to web servers in the following way:

  • mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
  • mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
  • mail.insect.com is accessed by users from host machines user-{4, 9}
  • mail.onion.com is accessed by users from host machines user-{7, 10}

The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):

  • Attack 1: multi-step attack with sequential execution of the following attacks:
    • nmap scan
    • nikto scan
    • smtp-user-enum tool for account enumeration
    • hydra brute force login
    • webshell upload through Horde exploit (CVE-2019-9858)
    • privilege escalation through Exim exploit (CVE-2019-10149)
  • Attack 2: webshell injection through malicious cookie (CVE-2019-16885)

Attacks are launched from the following user host machines. In each of the corresponding directories user-

  • user-6 attacks mail.cup.com
  • user-5 attacks mail.spiral.com
  • user-4 attacks mail.insect.com
  • user-7 attacks mail.onion.com

The log data collected from the web servers includes

  • Apache access and error logs
  • syscall logs collected with the Linux audit daemon
  • suricata logs
  • exim logs
  • auth logs
  • daemon logs
  • mail logs
  • syslogs
  • user logs


Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.

Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.

Version history and related data sets:

  • AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
    • AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
  • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

If you use the dataset, please cite the following publication:

[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

Search
Clear search
Close search
Google apps
Main menu