58 datasets found
  1. Web Server Access Logs

    • kaggle.com
    zip
    Updated Feb 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Dabbas (2021). Web Server Access Logs [Dataset]. https://www.kaggle.com/eliasdabbas/web-server-access-logs
    Explore at:
    zip(279607491 bytes)Available download formats
    Dataset updated
    Feb 13, 2021
    Authors
    Elias Dabbas
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Web sever logs contain information on any event that was registered/logged. This contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, and more.

    This is a dataset for trying to gain insights from such a file.

    Content

    3.3GB of logs from an Iranian ecommerce website zanbil.ir.

    Acknowledgements

    Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1

    Inspiration

    Trying to create an efficient pipeline for reading, parsing, compressing, and analyzing web server log files.

  2. i

    Apache Web Server - Access Log Pre-processing for Web Intrusion Detection

    • ieee-dataport.org
    Updated Aug 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Anis Al Hilmi (2021). Apache Web Server - Access Log Pre-processing for Web Intrusion Detection [Dataset]. https://ieee-dataport.org/open-access/apache-web-server-access-log-pre-processing-web-intrusion-detection
    Explore at:
    Dataset updated
    Aug 9, 2021
    Dataset provided by
    IEEE Dataport
    Authors
    Muhammad Anis Al Hilmi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is from apache access log server. It contains: ip address, datetime, gmt, request, status, size, user agent, country, label. The dataset show malicious activity in IP address, request, and so on. You can analyze more as intrusion detection parameter.Paper: http://jtiik.ub.ac.id/index.php/jtiik/article/view/4107

  3. LogHub - Apache Log Data

    • kaggle.com
    zip
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Duggineni (2023). LogHub - Apache Log Data [Dataset]. https://www.kaggle.com/datasets/omduggineni/loghub-apache-log-data
    Explore at:
    zip(254455 bytes)Available download formats
    Dataset updated
    Oct 13, 2023
    Authors
    Om Duggineni
    Description

    Dataset

    This dataset was created by Om Duggineni

    Contents

  4. Z

    Web robot detection - Server logs

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jan 4, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsoumakas, Grigorios (2021). Web robot detection - Server logs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3477931
    Explore at:
    Dataset updated
    Jan 4, 2021
    Dataset provided by
    Lagopoulos, Athanasios
    Tsoumakas, Grigorios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains server logs from the search engine of the library and information center of the Aristotle University of Thessaloniki in Greece (http://search.lib.auth.gr/). The search engine enables users to check the availability of books and other written works, and search for digitized material and scientific publications. The server logs obtained span an entire month, from March 1st to March 31 2018 and consist of 4,091,155 requests with an average of 131,973 requests per day and a standard deviation of 36,996.7 requests. In total, there are requests from 27,061 unique IP addresses and 3,441 unique user-agent strings. The server logs are in JSON format and they are anonymized by masking the last 6 digits of the IP address and by hashing the last part of the URLs requested (after last /). The dataset also contains the processed form of the server logs as a labelled dataset of log entries grouped into sessions along with their extracted features (simple semantic features). We make this dataset publicly available, the first one in this domain, in order to provide a common ground for testing web robot detection methods, as well as other methods that analyze server logs.

  5. Z

    Kyoushi Log Data Set

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Oct 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank, Maximilian (2023). Kyoushi Log Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5779410
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Frank, Maximilian
    Skopik, Florian
    Hotwagner, Wolfgang
    Wurzenberger, Markus
    Landauer, Max
    Rauber, Andreas
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.

    The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

    Each dataset contains traces of a specific attack scenario:

    Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):

    nmap scan

    WPScan

    dirb scan

    webshell upload through wpDiscuz exploit (CVE-2020-24186)

    privilege escalation

    Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):

    DNSteal data exfiltration

    The log data collected from the servers includes

    Apache access and error logs (labeled)

    audit logs (labeled)

    auth logs (labeled)

    VPN logs (labeled)

    DNS logs (labeled)

    syslog

    suricata logs

    exim logs

    horde logs

    mail logs

    Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

    [2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". ACM Workshop on Secure and Trustworthy Cyber-Physical Systems (ACM SaT-CPS 2022), April 27, 2022, Baltimore, MD, USA. ACM.

    [3] M. Frank, "Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems", Master's Thesis, Vienna University of Technology, 2021.

  6. AIT Log Data Set V2.0

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2024). AIT Log Data Set V2.0 [Dataset]. http://doi.org/10.5281/zenodo.5789064
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

    In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

    The datasets in this repository have the following structure:

    • The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather/.
    • The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
    • The processing directory contains the source code that was used to generate the labels.
    • The rules directory contains the labeling rules.
    • The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
    • The dataset.yml file specifies the start and end time of the simulation.

    The following table summarizes relevant properties of the datasets:

    • fox
      • Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
      • Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
      • Scan volume: High
      • Unpacked size: 26 GB
    • harrison
      • Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
      • Scan volume: High
      • Unpacked size: 27 GB
    • russellmitchell
      • Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
      • Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
      • Scan volume: Low
      • Unpacked size: 14 GB
    • santos
      • Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
      • Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
      • Scan volume: Low
      • Unpacked size: 17 GB
    • shaw
      • Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
      • Scan volume: Low
      • Data exfiltration is not visible in DNS logs
      • Unpacked size: 27 GB
    • wardbeck
      • Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
      • Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
      • Scan volume: Low
      • Unpacked size: 26 GB
    • wheeler
      • Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
      • Scan volume: High
      • No password cracking in attack chain
      • Unpacked size: 30 GB
    • wilson
      • Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
      • Scan volume: High
      • Unpacked size: 39 GB

    The following attacks are launched in the network:

    • Scans (nmap, WPScan, dirb)
    • Webshell upload (CVE-2020-24186)
    • Password cracking (John the Ripper)
    • Privilege escalation
    • Remote command execution
    • Data exfiltration (DNSteal)

    Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

    The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

    {"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

    type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

    Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.

    Version history:

    • AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
    • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,

  7. Z

    Comprehensive Network Logs Dataset for Multi-Device Analysis

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salman, Mahmood (2024). Comprehensive Network Logs Dataset for Multi-Device Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10492769
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    Salman, Mahmood
    Hasan, Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises diverse logs from various sources, including cloud services, routers, switches, virtualization, network security appliances, authentication systems, DNS, operating systems, packet captures, proxy servers, servers, syslog data, and network data. The logs encompass a wide range of information such as traffic details, user activities, authentication events, DNS queries, network flows, security actions, and system events. By analyzing these logs collectively, users can gain insights into network patterns, anomalies, user authentication, cloud service usage, DNS traffic, network flows, security incidents, and system activities. The dataset is invaluable for network monitoring, performance analysis, anomaly detection, security investigations, and correlating events across the entire network infrastructure.

  8. NASA HTTP Logs Dataset - Processed for LSTM Models

    • kaggle.com
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pasan Bhanu Guruge (2024). NASA HTTP Logs Dataset - Processed for LSTM Models [Dataset]. https://www.kaggle.com/datasets/pasanbhanuguruge/nasa-http-logs-dataset-processed-for-lstm-models
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pasan Bhanu Guruge
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    These two traces contain two month's worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. The first log was collected from 00:00:00 July 1, 1995 through 23:59:59 July 31, 1995, a total of 31 days. The second log was collected from 00:00:00 August 1, 1995 through 23:59:59 Agust 31, 1995, a total of 7 days. In this two week period there were 3,461,612 requests. Timestamps have 1 second resolution. Note that from 01/Aug/1995:14:52:01 until 03/Aug/1995:04:36:13 there are no accesses recorded, as the Web server was shut down, due to Hurricane Erin.

    Acknowledgements

    The logs was collected by Jim Dumoulin of the Kennedy Space Center, and contributed by Martin Arlitt (mfa126@cs.usask.ca) and Carey Williamson (carey@cs.usask.ca) of the University of Saskatchewan.

    Source

    https://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

  9. Z

    AIT Log Data Set V1.1

    • data.niaid.nih.gov
    Updated Oct 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hotwagner, Wolfgang (2023). AIT Log Data Set V1.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3723082
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Skopik, Florian
    Hotwagner, Wolfgang
    Wurzenberger, Markus
    Landauer, Max
    Rauber, Andreas
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.

    The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".

    The data directory is structured as follows. Each directory mail.

    Setup details of the web servers:

    OS: Debian Stretch 9.11.6

    Services:

    Apache2

    PHP7

    Exim 4.89

    Horde 5.2.22

    OkayCMS 2.3.4

    Suricata

    ClamAV

    MariaDB

    Setup details of user machines:

    OS: Ubuntu Bionic

    Services:

    Chromium

    Firefox

    User host machines are assigned to web servers in the following way:

    mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}

    mail.spiral.com is accessed by users from host machines user-{3, 5, 8}

    mail.insect.com is accessed by users from host machines user-{4, 9}

    mail.onion.com is accessed by users from host machines user-{7, 10}

    The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):

    Attack 1: multi-step attack with sequential execution of the following attacks:

    nmap scan

    nikto scan

    smtp-user-enum tool for account enumeration

    hydra brute force login

    webshell upload through Horde exploit (CVE-2019-9858)

    privilege escalation through Exim exploit (CVE-2019-10149)

    Attack 2: webshell injection through malicious cookie (CVE-2019-16885)

    Attacks are launched from the following user host machines. In each of the corresponding directories user-

    user-6 attacks mail.cup.com

    user-5 attacks mail.spiral.com

    user-4 attacks mail.insect.com

    user-7 attacks mail.onion.com

    The log data collected from the web servers includes

    Apache access and error logs

    syscall logs collected with the Linux audit daemon

    suricata logs

    exim logs

    auth logs

    daemon logs

    mail logs

    syslogs

    user logs

    Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.

    Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.

    Version history and related data sets:

    AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

    AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.

    AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publication:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

  10. server_logs

    • kaggle.com
    zip
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Odoyo (2022). server_logs [Dataset]. https://www.kaggle.com/datasets/kevinodoyo/server-logs
    Explore at:
    zip(43706569 bytes)Available download formats
    Dataset updated
    Aug 2, 2022
    Authors
    Kevin Odoyo
    Description

    Dataset

    This dataset was created by Kevin Odoyo

    Contents

  11. m

    Data from: Pillar 3: Pre-processed web server log file dataset of the...

    • data.mendeley.com
    Updated Dec 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michal Munk (2021). Pillar 3: Pre-processed web server log file dataset of the banking institution [Dataset]. http://doi.org/10.17632/5bvkm76sdc.1
    Explore at:
    Dataset updated
    Dec 6, 2021
    Authors
    Michal Munk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset represents the pre-processed web server log file of the commercial bank. The source of data is the web server of the bank and keeps access of web users starting the year 2009 till 2012. It contains accesses to the bank website during and after the financial crisis. Unnecessary data saved by the web server was removed to keep the focus only on the textual content of the website. Many variables were added to the original log file to make the analysis workable. To keep the privacy of website users, sensitive information in the log file were anonymized. The dataset offers the way to understand the behaviour of stakeholders during and after the crisis and how they comply with the Basel regulations.

  12. i

    CRAWDAD usc/mobilib

    • ieee-dataport.org
    Updated Aug 20, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei-jen Hsu Hsu (2008). CRAWDAD usc/mobilib [Dataset]. http://doi.org/10.15783/C79W25
    Explore at:
    Dataset updated
    Aug 20, 2008
    Dataset provided by
    IEEE Dataport
    Authors
    Wei-jen Hsu Hsu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    VPN session, DHCP log, and trap log data from wireless network at USC.This dataset includes VPN session, DHCP log, and tcap log data, for 79 access points and several thousand users at USC.date/time of measurement start: 2003-12-23date/time of measurement end: 2006-04-28collection environment: This data set was collected during 2003-2005 at the USC campus, where the number of WLAN users was over 4500.network configuration: At the time of collection, the USC wireless LAN had 79 APs.data collection methodology: These traces are logs for timestamps of (start|stop) of VPN sessions. At USC, wireless users must establish connections to a VPN server before they can use the network. Hence the session log contains periods of users potentailly using the network, with its private (dynamic) IP addresses.Tracesetusc/mobilib/sessionVPN session logs from USC wirelss network.file: USC_sessions.tgzdescription: This traceset contains logs for timestamps of (start|stop) of VPN sessions.measurement purpose: User Mobility Characterization, Usage Characterization, Human Behavior Modelingmethodology: These traces are logs for timestamps of (start|stop) of VPN sessions. At USC, wireless users must establish connections to a VPN server before they can use the network. Hence the session log contains periods of users potentailly using the network, with its private (dynamic) IP addresses.usc/mobilib/session Tracevpn: VPN session logs from USC wireless network.file: USC_sessions.tgzconfiguration: These logs of sessions are collected at the VPN server for wireless users at USC. Before using the network, users must establish a VPN session to the server. The "Start" and "Stop" timestamps in the trace represents the beginning and the end of these VPN sessions.format: The fields in each line of the trace are: 1. Day of the week: Sun, Mon, Tue, Wed, Thu, Fri, Sat 2. Month 3. Day 4. Time: HH:MM:SS 5. Action: "Start" or "Stop" of a session. 6. Private IP in USC network. 7. Public IP given to the host.usc/mobilib/dhcpDHCP logs from USC wirelss network.file: USC_dhcp.tgzdescription: This traceset contains The DHCP log of the private IP assignments to MAC addresses.measurement purpose: User Mobility Characterization, Usage Characterization, Human Behavior Modelingmethodology: The DHCP log contains the private IP assignments to MAC addresses.usc/mobilib/dhcp Tracedhcp_log: Trace of DHCP logs from USC wirelss network.configuration: This log contains the private IP assignments to MAC addresses. The listed private IP is given to the MAC address at the indicated time.format: The fields are: 1. Month 2. Day 3. Time: HH:MM:SS 4. Private IP in USC network 5. MAC addressusc/mobilib/trapTrap logs from USC wirelss network.file: USC_traps.tgz, USC_old_trap.tgzdescription: This traceset contains the trap log of the (switch port, MAC address) association when the user is online.measurement purpose: User Mobility Characterization, Usage Characterization, Human Behavior Modelingmethodology: The trap log contains the (switch port, MAC address) association when the user is online. However, if a MAC re-appears at the same switch port when it was last online, the trap log may NOT record this information. Hence trap logmust be used in conjunction with session log to discover all association sessions. The file [Mapping] is the mapping between switch (IP, port) and the building code of USC campus. USC campus map is available through university website.limitation: WARNING: The trap log alone does NOT contain all user online events! If a user comes online at the same switch port repeatedly, it does NOT create separate trap log for each new online event. Also, the trap log only records the online epoch, but not online duration information of any kind. usc/mobilib/trap Tracetrap_log: Trace of trap logs collected from USC wirelss network during 2005.configuration: The trap log contains the (switch port, MAC address) association when the user is online. This log records the approximate location of nodes, since the switch ports correspond to buildings in USC network. However, if a node reappears repeatedly at the same switch port, a new trap entry may not be generated. Hence the trap log is mainly used as an indication of the "last seen" location of the node, and we assume it does not move unless indicated otherwise by a new trap entry.format: The fields are: 1. Month 2. Day 3. Time: HH:MM:SS 4. Switch IP 5. Switch port (switch IP + switch port is used to locate the node on USC campus map, the Mapping file is also available online) 6. MAC addressold_trap_log: Trace of trap logs collected from USC wirelss network during 2003-2005.configuration: The trap log contains the (switch port, MAC address) association when the user is online. This log records the approximate location of nodes, since the switch ports correspond to buildings in USC network. However, if a node reappears repeatedly at the same switch port, a new trap entry may not be generated. Hence the trap log is mainly used as an indication of the "last seen" location of the node, and we assume it does not move unless indicated otherwise by a new trap entry.format: The fields are: 1. Month 2. Day 3. Time: HH:MM:SS 4. Switch IP 5. Switch port (switch IP + switch port is used to locate the node on USC campus map, the Mapping file is also available online) 6. MAC addressusc/mobilib/associationAssociation history from USC wirelss network.file: trace_processing_code.tgz, USC_duration_trace.tgz, USC_2005_summer.tgz, USC_06spring_trace.tar.gzdescription: this traceset contains "association history" traces for individual MAC addresses, which consist of start times and end times of a MAC associated with various locations.measurement purpose: User Mobility Characterization, Usage Characterization, Human Behavior Modelingmethodology: From the raw traces (session, dhcp, and trap) it is possible to find out user locations (at per switch port granularity, which roughly corresponds to buildings on campus) when they are online. This "association history" trace for individual MAC addresses consists of start times and end times of a MAC associated with various locations. The location granularity is per switch port, roughly corresponding to buildings on campus. There are three files related with generation of association history traces. (1) session file: Records of start/stop of a association session, with the corresponding private IP address. (2) dhcp file: Records of private IPs to MAC address binding. (3) trap file: Records of MAC address showing up at switch ports. The conversion involves getting session durations from (1), then converting the IP address in (1) to MAC address using (2), finally finding the locations of these MAC addresses using (3). The file [Processing code] is the program code we used for trace processing. For more detail about the trace processing, please see [Memo of USC trace processing]. usc/mobilib/association Trace duration_log: Trace of association history from USC wirelss network for one month.configuration: For the processed trace, we have the association history for each MAC address in a separate file.format: The fields in these files are: 1. Start timestamp: The starting time of an association record. The timestamp is defined as the elapsed time since Apr. 1, 2005 in unit of seconds. 2. Location: the building code of the association record. 3. Duration: duration of the association record, in unit of seconds.summer_duration_log: Trace of association history from USC wirelss network during 2005 summer.configuration: For the processed trace, we have the association history for each MAC address in a separate file. This trace is a longer processed trace for the whole summer. Please note that the summer vacation is from mid-May to mid-Aug for USC, and the WLAN activity significantly reduced during the summer vacation.format: The fields in these files are: 1. Start timestamp: The starting time of an association record. The timestamp is defined as the elapsed time since Apr. 1, 2005 in unit of seconds. 2. Location: the building code of the association record. 3. Duration: duration of the association record, in unit of seconds.spring_2006_duration_log: Trace of association history from USC wirelss network during Spring 2006.configuration: This data set contains 25,481 users that appeared during Jan. 25, 2006 to Apr. 28, 2006. During this time frame, there were 137 unique locations in the trace. Each location roughly corresponds to a building on campus, and it is encoded in the format of IP_port (the actual switch port that controls traffic to/from this location).format: The fields in these files are: 1. Start timestamp: The starting time of an association record. The timestamp is defined as the elapsed time since Jan. 1, 2006 in unit of seconds. 2. Location: the format of IP_port (the actual switch port that controls traffic to/from this location). 3. Duration: duration of the association record, in unit of seconds. For more information on the trace format and the processing procedure, please refer to the documents [Memo Format USC06] and [Memo processing USC06].

  13. Server_logs_dataset

    • kaggle.com
    zip
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Azmat (2020). Server_logs_dataset [Dataset]. https://www.kaggle.com/azmatsiddique/server-logs-dataset
    Explore at:
    zip(12997 bytes)Available download formats
    Dataset updated
    Jul 1, 2020
    Authors
    Azmat
    Description

    Dataset

    This dataset was created by Azmat

    Contents

  14. Elasticsearch Server JSON Logs

    • zenodo.org
    application/gzip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Wang; Devin Gibson; Rui Wang; Devin Gibson (2024). Elasticsearch Server JSON Logs [Dataset]. http://doi.org/10.5281/zenodo.10516227
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rui Wang; Devin Gibson; Rui Wang; Devin Gibson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Logs from an Elasticsearch server instance. The logs were generated by using Elasticsearch to index another JSON dataset.

  15. Linux Logs

    • kaggle.com
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Srinidhi (2023). Linux Logs [Dataset]. https://www.kaggle.com/datasets/ggsri123/linux-logs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Srinidhi
    Description

    Linux Logs

    About the Dataset

    This data contains 2k log lines from the Linux Dataset, derived from LogPai Github Repository. The first file contains just the log lines. Second, contains the log lines with their categorized fields - namely Month, Date, Time, Level, Component, PID, Content, EventID, and EventTemplate.
    

    Interesting Task Ideas

    1. Understanding the frequency of different Event Types (EventID) that occur in the log set.
    2. Identifying anomaly in the logs, if it exists.
    3. Named Entity Recognition - To identify different fields of the log set from the set-aside data.
    4. Multiclass classification - To identify what Event Type (EventID) the log line belongs to. 
    5. Adding variable parts (<*>) a name, and adding it to the entity recognition task. [Boss level!]
    

    Point 5 explanation: In the 3rd file named Linux_2k.log_templates.csv, for each of the event types (given by EventIDs) there is a template. The template consists of a variable portion (given by <*>) and a constant portion (the other words in the template). The value of this variable part can be found by comparing the template against the log line containing this template. A name could be assigned to the variable part and be accounted for named entity recognition. Keep in mind the frequency of a variable part might be limited.

    Note: An important idea to have in mind is that one will have to focus on the syntax more than the semantics of a log line.

    Have fun understanding how to apply NLP concepts to Log Datasets! 😀

    Check out my other Datasets here

    MIT License
    
    Copyright (c) 2018 LogPAI
    
    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
    
    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
    
  16. Web Server Access Log

    • zenodo.org
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rikhi Ram Jagat; Rikhi Ram Jagat; Dilip Singh Sisodia; Pradeep Singh; Dilip Singh Sisodia; Pradeep Singh (2024). Web Server Access Log [Dataset]. http://doi.org/10.5281/zenodo.7895435
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rikhi Ram Jagat; Rikhi Ram Jagat; Dilip Singh Sisodia; Pradeep Singh; Dilip Singh Sisodia; Pradeep Singh
    Description

    Small E-commerce of course selling website web server access log.

  17. 1998 World Cup Website Access Logs

    • zenodo.org
    application/gzip
    Updated Jul 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sona Ghahremani; Sona Ghahremani (2021). 1998 World Cup Website Access Logs [Dataset]. http://doi.org/10.5281/zenodo.5145855
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 30, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sona Ghahremani; Sona Ghahremani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Description:

    The access logs, as well as the accompanying description, are directly taken from [1] and include traffic of the 1998 World Cup website on three days as follows. The log files have the following naming format "wc_dayX_Y.gz"

    where:

    • X is an integer that represents the day the access log was collected
    • Y is an integer that represents the subinterval for a particular day

    This collection includes three log files containing the access traffic on three different days as listed below:

    wc_day25_1.gz May 20, 1998 -> TR1
    wc_day9_1.gz May 4, 1998  -> TR2
    wc_day28_1.gz May 23, 1998 -> TR3

    Format

    The access logs from the 1998 World Cup Web site were originally in the Common Log Format. In order to reduce both the size of the logs and the analysis time the access logs were converted to a binary format (big endian = network order). Each entry in the binary log is a fixed size and represents a single request to the site. The format of a request in the binary log looks like:

    struct request
    {
     uint32_t timestamp;
     uint32_t clientID;
     uint32_t objectID;
     uint32_t size;
     uint8_t method;
     uint8_t status;
     uint8_t type;
     uint8_t server;
    };

    The fields of the request structure contain the following information:

    timestamp - the time of the request, stored as the number of seconds since the Epoch. The timestamp has been converted to GMT to allow for portability. During the World Cup the local time was 2 hours ahead of GMT (+0200). In order to determine the local time, each timestamp must be adjusted by this amount.

    clientID - a unique integer identifier for the client that issued the request (this may be a proxy); due to privacy concerns these mappings cannot be released; note that each clientID maps to exactly one IP address, and the mappings are preserved across the entire data set - that is if IP address 0.0.0.0 mapped to clientID X on day Y then any request in any of the data sets containing clientID X also came from IP address 0.0.0.0

    objectID - a unique integer identifier for the requested URL; these mappings are also 1-to-1 and are preserved across the entire data set

    size - the number of bytes in the response

    method - the method contained in the client's request (e.g., GET).

    status - this field contains two pieces of information; the 2 highest order bits contain the HTTP version indicated in the client's request (e.g., HTTP/1.0); the remaining 6 bits indicate the response status code (e.g., 200 OK).

    type - the type of file requested (e.g., HTML, IMAGE, etc), generally based on the file extension (.html), or the presence of a parameter list (e.g., '?' indicates a DYNAMIC request). If the url ends with '/', it is considered a DIRECTORY.

    server - indicates which server handled the request. The upper 3 bits indicate which region the server was at (e.g., SANTA CLARA, PLANO, HERNDON, PARIS); the remaining bits indicate which server at the site handled the request. All 8 bits can also be used to determine a unique server.

    Reference

    [1] M. Arlitt and T. Jin, "1998 World Cup Web Site Access Logs", August 1998.

  18. Longitudinal navigation log data on the Radboud University web domain

    • phys-techsciences.datastations.nl
    pdf, txt, zip
    Updated Jan 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S. Verberne; de . de Vries; W. Kraaij; S. Verberne; de . de Vries; W. Kraaij (2024). Longitudinal navigation log data on the Radboud University web domain [Dataset]. http://doi.org/10.17026/dans-28m-mwht
    Explore at:
    txt(1896), pdf(539643), zip(17660), zip(244268022)Available download formats
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    Data Archiving and Networked Services
    Authors
    S. Verberne; de . de Vries; W. Kraaij; S. Verberne; de . de Vries; W. Kraaij
    License

    https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58

    Description

    We have collected the access logs for our university's web domain over a time span of 4.5 years. We now release the pre-processed web server log of a 3-month period for research into user navigation behavior. We preprocessed the data so that only successful GET requests of web pages by non-bot users are kept. The information that is included per entry is: unique user id, timestamp, GET request (URL), status code, the size of the object returned to the client, and the referrer URL. The resulting size of the 3-month collection is 9.6M page visits (190K unique URLs) by 744K unique visitors. The data collection allows for research on, among other things, user navigation, browsing and stopping behavior and web user clustering. Date Submitted: 2016-04-28

  19. Passive Operating System Fingerprinting Revisited - Network Flows Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Laštovička; Martin Laštovička; Martin Husák; Martin Husák; Petr Velan; Petr Velan; Tomáš Jirsík; Tomáš Jirsík; Pavel Čeleda; Pavel Čeleda (2023). Passive Operating System Fingerprinting Revisited - Network Flows Dataset [Dataset]. http://doi.org/10.5281/zenodo.7635138
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Laštovička; Martin Laštovička; Martin Husák; Martin Husák; Petr Velan; Petr Velan; Tomáš Jirsík; Tomáš Jirsík; Pavel Čeleda; Pavel Čeleda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For the evaluation of OS fingerprinting methods, we need a dataset with the following requirements:

    • First, the dataset needs to be big enough to capture the variability of the data. In this case, we need many connections from different operating systems.
    • Second, the dataset needs to be annotated, which means that the corresponding operating system needs to be known for each network connection captured in the dataset. Therefore, we cannot just capture any network traffic for our dataset; we need to be able to determine the OS reliably.

    To overcome these issues, we have decided to create the dataset from the traffic of several web servers at our university. This allows us to address the first issue by collecting traces from thousands of devices ranging from user computers and mobile phones to web crawlers and other servers. The ground truth values are obtained from the HTTP User-Agent, which resolves the second of the presented issues. Even though most traffic is encrypted, the User-Agent can be recovered from the web server logs that record every connection’s details. By correlating the IP address and timestamp of each log record to the captured traffic, we can add the ground truth to the dataset.

    For this dataset, we have selected a cluster of five web servers that host 475 unique university domains for public websites. The monitoring point recording the traffic was placed at the backbone network connecting the university to the Internet.

    The dataset used in this paper was collected from approximately 8 hours of university web traffic throughout a single workday. The logs were collected from Microsoft IIS web servers and converted from W3C extended logging format to JSON. The logs are referred to as web logs and are used to annotate the records generated from packet capture obtained by using a network probe tapped into the link to the Internet.

    The entire dataset creation process consists of seven steps:

    1. The packet capture was processed by the Flowmon flow exporter (https://www.flowmon.com) to obtain primary flow data containing information from TLS and HTTP protocols.
    2. Additional statistical features were extracted using GoFlows flow exporter (https://github.com/CN-TU/go-flows).
    3. The primary flows were filtered to remove incomplete records and network scans.
    4. The flows from both exporters were merged together into records containing fields from both sources.
    5. Web logs were filtered to cover the same time frame as the flow records.
    6. Web logs were paired with the flow records based on shared properties (IP address, port, time).
    7. The last step was to convert the User-Agent values into the operating system using a Python version of the open-source tool ua-parser (https://github.com/ua-parser/uap-python). We replaced the unstructured User-Agent string in the records with the resulting OS.

    The collected and enriched flows contain 111 data fields that can be used as features for OS fingerprinting or any other data analyses. The fields grouped by their area are listed below:

    • basic flow properties - flow_ID;start;end;L3 PROTO;L4 PROTO;BYTES A;PACKETS A;SRC IP;DST IP;TCP flags A;SRC port;DST port;packetTotalCountforward;packetTotalCountbackward;flowDirection;flowEndReason;
    • IP parameters - IP ToS;maximumTTLforward;maximumTTLbackward;IPv4DontFragmentforward;IPv4DontFragmentbackward;
    • TCP parameters - TCP SYN Size;TCP Win Size;TCP SYN TTL;tcpTimestampFirstPacketbackward;tcpOptionWindowScaleforward;tcpOptionWindowScalebackward;tcpOptionSelectiveAckPermittedforward;tcpOptionSelectiveAckPermittedbackward;tcpOptionMaximumSegmentSizeforward;tcpOptionMaximumSegmentSizebackward;tcpOptionNoOperationforward;tcpOptionNoOperationbackward;synAckFlag;tcpTimestampFirstPacketforward;
    • HTTP - HTTP Request Host;URL;
    • User-agent - UA OS family;UA OS major;UA OS minor;UA OS patch;UA OS patch minor;
    • TLS - TLS_CONTENT_TYPE;TLS_HANDSHAKE_TYPE;TLS_SETUP_TIME;TLS_SERVER_VERSION;TLS_SERVER_RANDOM;TLS_SERVER_SESSION_ID;TLS_CIPHER_SUITE;TLS_ALPN;TLS_SNI;TLS_SNI_LENGTH;TLS_CLIENT_VERSION;TLS_CIPHER_SUITES;TLS_CLIENT_RANDOM;TLS_CLIENT_SESSION_ID;TLS_EXTENSION_TYPES;TLS_EXTENSION_LENGTHS;TLS_ELLIPTIC_CURVES;TLS_EC_POINT_FORMATS;TLS_CLIENT_KEY_LENGTH;TLS_ISSUER_CN;TLS_SUBJECT_CN;TLS_SUBJECT_ON;TLS_VALIDITY_NOT_BEFORE;TLS_VALIDITY_NOT_AFTER;TLS_SIGNATURE_ALG;TLS_PUBLIC_KEY_ALG;TLS_PUBLIC_KEY_LENGTH;TLS_JA3_FINGERPRINT;
    • Packet timings - NPM_CLIENT_NETWORK_TIME;NPM_SERVER_NETWORK_TIME;NPM_SERVER_RESPONSE_TIME;NPM_ROUND_TRIP_TIME;NPM_RESPONSE_TIMEOUTS_A;NPM_RESPONSE_TIMEOUTS_B;NPM_TCP_RETRANSMISSION_A;NPM_TCP_RETRANSMISSION_B;NPM_TCP_OUT_OF_ORDER_A;NPM_TCP_OUT_OF_ORDER_B;NPM_JITTER_DEV_A;NPM_JITTER_AVG_A;NPM_JITTER_MIN_A;NPM_JITTER_MAX_A;NPM_DELAY_DEV_A;NPM_DELAY_AVG_A;NPM_DELAY_MIN_A;NPM_DELAY_MAX_A;NPM_DELAY_HISTOGRAM_1_A;NPM_DELAY_HISTOGRAM_2_A;NPM_DELAY_HISTOGRAM_3_A;NPM_DELAY_HISTOGRAM_4_A;NPM_DELAY_HISTOGRAM_5_A;NPM_DELAY_HISTOGRAM_6_A;NPM_DELAY_HISTOGRAM_7_A;NPM_JITTER_DEV_B;NPM_JITTER_AVG_B;NPM_JITTER_MIN_B;NPM_JITTER_MAX_B;NPM_DELAY_DEV_B;NPM_DELAY_AVG_B;NPM_DELAY_MIN_B;NPM_DELAY_MAX_B;NPM_DELAY_HISTOGRAM_1_B;NPM_DELAY_HISTOGRAM_2_B;NPM_DELAY_HISTOGRAM_3_B;NPM_DELAY_HISTOGRAM_4_B;NPM_DELAY_HISTOGRAM_5_B;NPM_DELAY_HISTOGRAM_6_B;NPM_DELAY_HISTOGRAM_7_B;
    • ICMP - ICMP TYPE;

    The details of OS distribution grouped by the OS family are summarized in the table below. The Other OS family contains records generated by web crawling bots that do not include OS information in the User-Agent.

    OS FamilyNumber of flows
    Other42474
    Windows40349
    Android10290
    iOS8840
    Mac OS X5324
    Linux1589
    Ubuntu653
    Fedora88
    Chrome OS53
    Symbian OS1
    Slackware1
    Linux Mint1

  20. s

    TDA-Gallica

    • marketplace.sshopencloud.eu
    Updated May 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). TDA-Gallica [Dataset]. https://marketplace.sshopencloud.eu/dataset/armYxP
    Explore at:
    Dataset updated
    May 10, 2023
    Description

    This dataset, containing a topological analysis of server logs, was created in a project that aimed at documenting the behavior of scientists on online platforms by making sense of the digital trace they generate while navigating. The repository contains the Jupyter notebook that was run on the cluster, its aim was to construct the sessions from the large data provided by Gallica user navigations, the Jupyter notebook that contains topological data analysis and cluster visualizations and the final report of the project.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Elias Dabbas (2021). Web Server Access Logs [Dataset]. https://www.kaggle.com/eliasdabbas/web-server-access-logs
Organization logo

Web Server Access Logs

A sample of web server logs file

Explore at:
zip(279607491 bytes)Available download formats
Dataset updated
Feb 13, 2021
Authors
Elias Dabbas
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Web sever logs contain information on any event that was registered/logged. This contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, and more.

This is a dataset for trying to gain insights from such a file.

Content

3.3GB of logs from an Iranian ecommerce website zanbil.ir.

Acknowledgements

Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1

Inspiration

Trying to create an efficient pipeline for reading, parsing, compressing, and analyzing web server log files.

Search
Clear search
Close search
Google apps
Main menu