5 datasets found
  1. Server Logs

    • kaggle.com
    Updated Oct 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishnu U (2021). Server Logs [Dataset]. https://www.kaggle.com/vishnu0399/server-logs/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vishnu U
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The dataset is a synthetically generated server log based on Apache Server Logging Format. Each line corresponds to each log entry. The log entry has the following parameters :

    Components in Log Entry :

    • IP of client: This refers to the IP address of the client that sent the request to the server.
    • Remote Log Name: Remote name of the User performing the request. In the majority of the applications, this is confidential information and is hidden or not available.
    • User ID: The ID of the user performing the request. In the majority of the applications, this is a piece of confidential information and is hidden or not available.
    • Date and Time in UTC format: The date and time of the request are represented in UTC format as follows: - Day/Month/Year:Hour:Minutes: Seconds +Time-Zone-Correction.
    • Request Type: The type of request (GET, POST, PUT, DELETE) that the server got. This depends on the operation that the request will do.
    • API: The API of the website to which the request is related. Example: When a user accesses a carton shopping website, the API comes as /usr/cart.
    • Protocol and Version: Protocol used for connecting with server and its version.
    • Status Code: Status code that the server returned for the request. Eg: 404 is sent when a requested resource is not found. 200 is sent when the request was successfully served.
    • Byte: The amount of data in bytes that was sent back to the client.
    • Referrer: The websites/source from where the user was directed to the current website. If none it is represented by “-“.
    • UA String: The user agent string contains details of the browser and the host device (like the name, version, device type etc.).
    • Response Time: The response time the server took to serve the request. This is the difference between the timestamps when the request was received and when the request was served.

    Content

    The dataset consists of two files - - logfiles.log is the actual log file in text format - TestFileGenerator.py is the synthetic log file generator. The number of log entries required can be edited in the code.

  2. AIT Log Data Set V2.0

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2024). AIT Log Data Set V2.0 [Dataset]. http://doi.org/10.5281/zenodo.5789064
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Florian Skopik; Maximilian Frank; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

    In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

    The datasets in this repository have the following structure:

    • The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather/.
    • The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
    • The processing directory contains the source code that was used to generate the labels.
    • The rules directory contains the labeling rules.
    • The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
    • The dataset.yml file specifies the start and end time of the simulation.

    The following table summarizes relevant properties of the datasets:

    • fox
      • Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
      • Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
      • Scan volume: High
      • Unpacked size: 26 GB
    • harrison
      • Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
      • Scan volume: High
      • Unpacked size: 27 GB
    • russellmitchell
      • Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
      • Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
      • Scan volume: Low
      • Unpacked size: 14 GB
    • santos
      • Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
      • Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
      • Scan volume: Low
      • Unpacked size: 17 GB
    • shaw
      • Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
      • Scan volume: Low
      • Data exfiltration is not visible in DNS logs
      • Unpacked size: 27 GB
    • wardbeck
      • Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
      • Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
      • Scan volume: Low
      • Unpacked size: 26 GB
    • wheeler
      • Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
      • Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
      • Scan volume: High
      • No password cracking in attack chain
      • Unpacked size: 30 GB
    • wilson
      • Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
      • Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
      • Scan volume: High
      • Unpacked size: 39 GB

    The following attacks are launched in the network:

    • Scans (nmap, WPScan, dirb)
    • Webshell upload (CVE-2020-24186)
    • Password cracking (John the Ripper)
    • Privilege escalation
    • Remote command execution
    • Data exfiltration (DNSteal)

    Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

    The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

    {"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

    type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

    Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.

    Version history:

    • AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
    • AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,

  3. MaRV Scripts and Dataset

    • zenodo.org
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

    Our dataset is located at the path dataset/MaRV.json

    The guidelines for replicating the study are provided below:

    Requirements

    1. Software Dependencies:

    • Python 3.10+ with packages in requirements.txt
    • Git: Required to clone repositories.
    • Java 17: RefactoringMiner requires Java 17 to perform the analysis.
    • PHP 8.0: Required to host the Web tool.
    • MySQL 8: Required to store the Web tool data.

    2. Environment Variables:

    • Create a .env file based on .env.example in the src folder and set the variables:
      • CSV_PATH: Path to the CSV file containing the list of repositories to be processed.
      • CLONE_DIR: Directory where repositories will be cloned.
      • JAVA_PATH: Path to the Java executable.
      • REFACTORING_MINER_PATH: Path to RefactoringMiner.

    Refactoring Technique Selection

    1. Environment Setup:

    • Ensure all dependencies are installed. Install the required Python packages with:
      pip install -r requirements.txt
      

    2. Configuring the Repositories CSV:

    • The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

    3. Executing the Script:

    • Configure the environment variables in the .env file and set up the repositories CSV, then run:
      python3 src/run_rm.py
      
    • The RefactoringMiner output from the 126 repositories of our study is available at:
      https://zenodo.org/records/14395034

    4. Script Behavior:

    • The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.
    • Results and Logs:
      • Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.
      • Logs for each repository, including error messages, are saved as .log files in the same directory.

    5. Count Refactorings:

    • To count instances for each refactoring technique, run:
      python3 src/count_refactorings.py
      
    • The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

    Data Gathering

    • To collect snippets before and after refactoring and their metadata, run:

      python3 src/diff.py '[refactoring technique]'
      

      Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

    • The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

    • Dataset Availability:

      • The snippets and metadata from the 126 repositories of our study are available in the dataset directory.
    • To generate the SQL file for the Web tool, run:

      python3 src/generate_refactorings_sql.py
      

    Web Tool for Manual Evaluation

    • The Web tool scripts are available in the web directory.
    • Populate the data/output/snippets folder with the output of src/diff.py.
    • Run the sql/create_database.sql script in your database.
    • Import the SQL file generated by src/generate_refactorings_sql.py.
    • Run dataset.php to generate the MaRV dataset file.
    • The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
  4. The toxicity data of compounds

    • figshare.com
    zip
    Updated Mar 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiang Lu (2025). The toxicity data of compounds [Dataset]. http://doi.org/10.6084/m9.figshare.27195339.v5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    figshare
    Authors
    Jiang Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    toxric_30_datasets.zip: The expanded predictive toxicology dataset is sourced from TOXRIC, a comprehensive and standardized toxicology database. The toxric_30_datasets contains 30 assay datasets with ~150,000 measurements related to five categories. These categories span a range of toxicity assessment, including genetic toxicity, organic toxicity, clinical toxicity, developmental and reproductive toxicity, and reactive toxicity. multiple_endpoint_acute_toxicity_dataset.zip & all_descriptors.txt: This 59-endpoint acute toxicity dataset is sourced from TOXRIC. It includes 59 various toxicity endpoints with 80,081 unique compounds represented using SMILES strings, and 122,594 usable toxicity measurements described by continuous values with a unified toxicity chemical unit: -log(mol/kg). The larger the measurement value, the stronger the toxicity intensity of the corresponding compound towards a certain endpoint. The 59 acute toxicity endpoints involve 15 different species including mouse, rat, rabbit, guinea pig, dog, cat, bird wild, quail, duck, chicken, frog, mammal, man, women, and human, 8 different administration routes including intraperitoneal, intravenous, oral, skin, subcutaneous, intramuscular, parenteral, and unreported, and 3 different measurement indicators including LD50 (lethal dose 50%), LDLo (lethal dose low), and TDLo (toxic dose low). In this dataset, each compound only has toxicity measurement values concerning a small number of toxicity endpoints, so this dataset is very sparse with nearly 97.4% of compound-to-endpoint measurements missing. Meanwhile, this dataset is also extremely data-unbalanced with some endpoints having tens of thousands of toxicity measurements available, e.g., mouse-intraperitoneal-LD50 has 36,295 measurements, mouse-oral-LD50 has 23,373 measurements, and rat-oral-LD50 has 10,190 measurements, etc, while some endpoints contain only around 100 measurements like mouse-intravenous-LDLo, rat-intravenous-LDLo, frog-subcutaneous-LD50, and human-oral-TDLo, etc. The sparsity and unbalance of this dataset present acute toxicity evaluation as a challenging issue. Among the 59 endpoints, 21 endpoints with less than 200 measurements were considered small-sized endpoints, and 11 endpoints with more than 1000 measurements were treated as large-sized endpoints. Three endpoints targeting humans, human-oral-TDLo, women-oral-TDLo, and man-oral-TDLo, are typical small-sized endpoints, with only 140, 156, and 163 available toxicity measurements, respectively (The acute toxicity intensity measurement values of the 80,081 compounds concerning 59 acute toxic endpoints, as well as the 5-fold random splits, were provided in the multiple_endpoint_acute_toxicity_dataset.zip. The molecular fingerprints or feature descripors of the 80,081 compounds, such as Avalon, Morgan, and AtomPair, were given in the all_descriptors.txt).115-endpoint_acute_toxiciy_dataset.zip: We collected more acute toxicity data of compounds from PubChem database through web crawling. We unified all the toxicity measurement units into -log(mol/kg) and retained the endpoints with no less than 30 available samples per endpoint. Thus, a brand-new acute toxicity dataset containing 115 endpoints was established. Compared with the previous 59-endpoint acute toxicity dataset from TOXRIC, the number of acute toxicity endpoints in this new dataset has doubled, adding more possible species (like goat, monkey, hamster, etc), administration routes (like intracerebral, intratracheal), and measurement indicators (like LD10, LD20). It should be emphasized that the sample imbalance among endpoints and the data missing rate of this dataset are more severe. Its sparsity rate reaches 98.7%, and it contains 68 small-sample acute toxicity endpoints (i.e., endpoints with less than 200 toxicity measurement data), among which the endpoint with the fewest samples has only 30 available measurement data. Therefore, this dataset is more challenging for all current acute toxicity prediction models.

  5. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
    • Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    Change Log

    Version 2

    • Metadata: we include a metadata.csv with information about:
      • Anomaly categories
      • Root cause channel (signal in which the anomaly is first visible)
      • Affected channel (signal in which the anomaly might propagate) through coupled system dynamics
    • Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.
    • Two data files: CSV and parquet for convenience.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vishnu U (2021). Server Logs [Dataset]. https://www.kaggle.com/vishnu0399/server-logs/code
Organization logo

Server Logs

A Synthetic Server Logs Dataset based on Apache Server Logs Format

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vishnu U
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

The dataset is a synthetically generated server log based on Apache Server Logging Format. Each line corresponds to each log entry. The log entry has the following parameters :

Components in Log Entry :

  • IP of client: This refers to the IP address of the client that sent the request to the server.
  • Remote Log Name: Remote name of the User performing the request. In the majority of the applications, this is confidential information and is hidden or not available.
  • User ID: The ID of the user performing the request. In the majority of the applications, this is a piece of confidential information and is hidden or not available.
  • Date and Time in UTC format: The date and time of the request are represented in UTC format as follows: - Day/Month/Year:Hour:Minutes: Seconds +Time-Zone-Correction.
  • Request Type: The type of request (GET, POST, PUT, DELETE) that the server got. This depends on the operation that the request will do.
  • API: The API of the website to which the request is related. Example: When a user accesses a carton shopping website, the API comes as /usr/cart.
  • Protocol and Version: Protocol used for connecting with server and its version.
  • Status Code: Status code that the server returned for the request. Eg: 404 is sent when a requested resource is not found. 200 is sent when the request was successfully served.
  • Byte: The amount of data in bytes that was sent back to the client.
  • Referrer: The websites/source from where the user was directed to the current website. If none it is represented by “-“.
  • UA String: The user agent string contains details of the browser and the host device (like the name, version, device type etc.).
  • Response Time: The response time the server took to serve the request. This is the difference between the timestamps when the request was received and when the request was served.

Content

The dataset consists of two files - - logfiles.log is the actual log file in text format - TestFileGenerator.py is the synthetic log file generator. The number of log entries required can be edited in the code.

Search
Clear search
Close search
Google apps
Main menu