Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📝 Description: This dataset captures simulated student interactions in a digital learning environment. Each row represents a unique learning session, containing comprehensive information about student behavior, engagement, performance, and progression over time.
The dataset is designed to support research and development in personalized education, adaptive learning systems, student engagement analysis, and feedback optimization. It enables the study of learning patterns and offers insights into how students interact with digital content, how they perform in assessments, and how their learning behavior evolves across sessions.
🌟 Key Features: Feature Description student_id Unique identifier for each student session_id Unique ID for each learning session timestamp Date and time of the session module_id Course/module accessed during the session time_spent_minutes Time spent in the session (in minutes) pages_visited Number of content pages visited video_watched_percent Percentage of video watched during the session click_events Number of interactions (clicks, navigations, etc.) notes_taken Whether the student took notes (1 = yes, 0 = no) forum_posts Number of forum posts/comments made revisit_flag Indicates if content was revisited quiz_score Score obtained in the session quiz (0–100) attempts_taken Number of quiz attempts made assignment_score Score in the session’s assignment (0–100) feedback_rating Student’s feedback rating for the session (1–5) days_since_last_activity Number of days since last session cumulative_quiz_score Running total of all previous quiz scores learning_trend Average performance across sessions attention_score Derived indicator of engagement during the session feedback_type Type of feedback given (e.g., revise topic, pace slow) next_module_prediction Suggested next module for the student success_label Indicator of learning success (1 = successful, 0 = not)
📊 Dataset Overview: Total Records: 9,000+ learning sessions
Total Students: 300
Total Features: 22
Data Format: CSV
Time-Series Ready: Yes (sequential session data per student)
💡 Use Cases: Analyze and visualize student learning patterns
Evaluate content engagement and session behavior
Develop personalized learning dashboards or analytics tools
Simulate adaptive feedback systems for digital education
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises diverse logs from various sources, including cloud services, routers, switches, virtualization, network security appliances, authentication systems, DNS, operating systems, packet captures, proxy servers, servers, syslog data, and network data. The logs encompass a wide range of information such as traffic details, user activities, authentication events, DNS queries, network flows, security actions, and system events. By analyzing these logs collectively, users can gain insights into network patterns, anomalies, user authentication, cloud service usage, DNS traffic, network flows, security incidents, and system activities. The dataset is invaluable for network monitoring, performance analysis, anomaly detection, security investigations, and correlating events across the entire network infrastructure.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This realistic incident management event log simulates a common IT service process and includes key inefficiencies found in real-world operations. You'll uncover SLA violations, multiple reassignments, bottlenecks, and conformance issues—making it an ideal dataset for hands-on process mining, root cause analysis, and performance optimization exercises.
You can find more event logs + use case handbooks to guide your analysis here: https://processminingdata.com/
Standard Process Flow: Ticket Created -> Ticket Assigned to Level 1 Support -> WIP - Level 1 Support -> Level 1 Escalates to Level 2 Support -> WIP - Level 2 Support -> Ticket Solved by Level 2 Support -> Customer Feedback Received -> Ticket Closed
Total Number of Incident Tickets: 31,000+
Process Variants: 13
Number of Events: 242,000+
Year: 2023
File Format: CSV
File Size: 65MB
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10074224%2F72a315b39866c02162b229d5a209f4b4%2F5.png?generation=1695227457850330&alt=media" alt="">
Data Fields:
- Status: A numerical indicator of the event status (e.g., 0 for success, 1 for error).
- Event: A textual description of the action or event, including error text if an error occurred.
- Device Identification: Information about the mobile device, including model and Android version.
- App Version: The version of the mobile application experiencing the event.
- App Language: The language in which the application is running.
- Android Version: The version of the Android operating system on the device.
- Session Identifiers: Unique session or device identifiers associated with the event.
- Additional Data: Additional event details, such as the country and other characteristics.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10074224%2Fbca8f9b9fb8288e258a59fad5e53ac15%2F4.png?generation=1695227273200372&alt=media" alt="">
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.
The datasets in this repository have the following structure:
The following table summarizes relevant properties of the datasets:
The following attacks are launched in the network:
Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.
The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:
{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:
type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.
Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.
Version history:
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset developed in the paper titled 'Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge' is designed to transform information found in logs into interpretable knowledge for subsequent use in large language model training.
Facebook
TwitterA CSV file prepared from application logs stemming from an SAP BI Warehouse system. This realistic dataset was generated as a means to showcase the SAP pilot use-case of the TOREADOR project. Each line corresponds to a user action. Extid, object and subobject were extracted from the BI system logs, along with the user name and event date. Role was retrieved from the standard user actions Label indicates whether the event is benign or malign - Elevation_of_privileges is an event the user should not be able to perform within the boundaries of his role. Priv_abuse is about a privileged account performing an action breaching a confidentiality clause (e.g. an administrator reading sensitive data). Forgotten_user is about a user who stayed inactive for a long time before being used again (e.g. an employee who left the company where the account was not terminated). Records where no malign activity was detected were marked 'benign'. Outbushours was computed from the time of the action, and mapped to 'inside' or 'outside' business hours.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.
The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".
The data directory is structured as follows. Each directory mail.
Setup details of the web servers:
Setup details of user machines:
User host machines are assigned to web servers in the following way:
The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):
Attacks are launched from the following user host machines. In each of the corresponding directories user-
The log data collected from the web servers includes
Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.
Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.
Version history and related data sets:
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to support research in anomaly detection for OS kernels, particularly in the context of power monitoring systems used in embedded environments. It simulates the interaction between system-level operations and power consumption behaviors, providing a rich set of features for training and evaluating hybrid models.
The dataset contains 1,000 records of yet realistic system behavior, including:
System call sequences
Power usage logs (in watts)
CPU and memory utilization
Process identifiers and names
Timestamps
Labeled entries (Normal or Anomaly)
Anomalies are injected using fuzzy testing principles to simulate abnormal power spikes, syscall irregularities, or excessive resource usage, mimicking real-world kernel faults or malicious activity. This dataset enables the development of robust models that can learn complex, uncertain system behavior patterns for enhanced security and stability of embedded power monitoring applications.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This failure dataset contains the injected faults, the workload, the effects of failure (both the user-side impact and our own in-depth correctness checks), and the error logs produced by the OpenStack cloud management system.Please refers to the paper "How Bad Can a Bug Get? Empirical analysis of software failures in the OpenStack cloud computing platform" (ESEC/FSE '19).Please, cite the following paper if you use the dataset:@inproceedings{cotroneo2019bad,title={How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform},author={Cotroneo, Domenico and De Simone, Luigi and Liguori, Pietro and Natella, Roberto and Bidokhti, Nematollah},booktitle={Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},pages={200--211},year={2019} }Visit the github repo for any updates: https://github.com/dessertlab/Fault-Injection-Dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern technologies such as the Internet of Things (IoT) are becoming increasingly important in various domains, including Business Process Management (BPM) research. One main research area in BPM is process mining, which can be used to analyze event logs, e.g., for checking the conformance of running processes. However, there are only a few IoT-based event logs available for research purposes. Some of them are artificially generated and the problem occurs that they do not always completely reflect the actual physical properties of smart environments. In this paper, we present an IoT-enriched XES event log that is generated by a physical smart factory. For this purpose, we create the DataStream/SensorStream XES extension for representing IoT-data in event logs. Finally, we present some preliminary analysis and properties of the log.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.
General Information:
This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.
You may find details of this dataset from the original paper:
Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".
If you use the data, implementation, or any details of the paper, please cite!
BIBTEX:
_
@inproceedings{nedelkoski2020multi,
title={Multi-source Distributed System Data for AI-Powered Analytics},
author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej},
booktitle={European Conference on Service-Oriented and Cloud Computing},
pages={161--176},
year={2020},
organization={Springer}
}
_
The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.
The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.
Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.
Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
· Financial expenses1 dataset: This dataset consists of simulated event logs generated from the financial expense data analysis process model. Each trace provides a detailed description of the process of analyzing office expense data. · Financial expenses2 dataset: This dataset consists of simulated event logs generated from the travel expense data analysis process model. Each trace provides a detailed description of the process of analyzing travel expense data. · Financial expenses3 dataset: This dataset consists of simulated event logs generated from the sales expense data analysis process model. Each trace provides a detailed description of the process of analyzing sales expense data. · Financial expenses4 dataset: This dataset consists of simulated event logs generated from the management expense data analysis process model. Each trace provides a detailed description of the process of analyzing management expense data. · Financial expenses5 dataset: This dataset consists of simulated event logs generated from the manufacturing expense data analysis process model. Each trace provides a detailed description of the process of analyzing manufacturing expense data. · Financial expenses6 dataset: This dataset consists of simulated event logs generated from the financial statement data analysis process model. Each trace provides a detailed description of the process of analyzing financial statement data.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset consolidates data from multiple sources to provide a comprehensive view of security anomalies, insider threats, system updates, and user management. It includes information such as user behavior patterns, anomaly detection metrics, system update details, and user contact information. Designed for multi-dimensional analysis, the dataset is ideal for tasks like anomaly detection, insider threat assessment, system update tracking, and user data management in cybersecurity applications. Each record is enriched with timestamps and other relevant attributes to enable dynamic analysis and decision-making.
Facebook
Twitterhttps://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
The comma separated value dataset contains process data from a production process, including data on cases, activities, resources, timestamps and more data fields.
Facebook
TwitterThis dataset captures logs from a distributed system, providing a comprehensive view of system behavior and performance. The logs encompass a range of activities, including system events, errors, and performance metrics, offering valuable insights for understanding and optimizing distributed system architectures.
Content: 1. File Format: CSV 2. Column Description: - Timestamp : Records the date and time of each logged event in the format [2023-11-20T08:40:50.664842], providing a chronological sequence for system activities. - LogLevel : Indicates the severity or importance of the logged event, classifying entries into levels such as INFO, WARNING, ERROR, or FATAL, providing insights into the significance of system occurrences. - Service : Specifies the name or identifier of the service associated with each log entry, facilitating the categorization and analysis of events based on the distributed system's modular components. - Message : Contains descriptive information or details related to the logged event, offering insights into the nature and context of the distributed system activity. - RequestID : Uniquely identifies each request, enabling traceability and correlation of log entries associated with specific transactions or operations in the distributed system. - User : Represents the user associated with the logged event, providing information about the entity interacting with the distributed system and aiding in user-centric analysis. - ClientIP : Uniquely identifies the client or application associated with the logged event, facilitating tracking and analysis of activities performed by different clients in the distributed system. - TimeTaken : Records the duration, in milliseconds or another specified unit, indicating the time taken to complete the corresponding operation or transaction in the distributed system.
Key Features: 2. Error Analysis: Logs capture error messages and exceptions, facilitating the identification and resolution of issues within the distributed system. 3. Performance Metrics: Explore performance-related metrics to assess system health, response times, and resource utilization. 4. Temporal Patterns: Analyze temporal patterns and trends to understand system behavior over time.
Potential Use Cases : 1. Anomaly Detection: Leverage the dataset for anomaly detection algorithms to identify unusual patterns or behaviors. 2. Performance Optimization: Use performance metrics to optimize resource allocation and improve overall system efficiency. 3. Predictive Maintenance: Anticipate potential issues by analyzing historical logs, enabling proactive system maintenance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.
Contents
The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.
Day 1, March 19, 2019
Start: 2019-03-19T11:00:00.000000+01:00
End: 2019-03-19T18:00:00.000000+01:00
Day 2, March 20, 2019
Start: 2019-03-20T08:00:00.000000+01:00
End: 2019-03-20T15:30:00.000000+01:00
The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.
cz.muni.csirt.IpfixEntry.tgz – an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.
cz.muni.csirt.SyslogEntry.tgz – an archive of Linux Syslog entries with the payload of corresponding text-based log messages.
cz.muni.csirt.WinlogEntry.tgz – an archive of Windows Event Log entries with the payload of original events in raw XML.
Each archive listed above includes a directory of the same name with the following four files, ready to be processed.
data.json.gz – the actual data entries in a single gzipped JSON file.
dictionary.yml – data dictionary for the entries.
schema.ddl – data schema for Apache Spark analytics engine.
schema.jsch – JSON schema for the entries.
Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.
global-gateway-config.json – the network configuration of the global gateway in the NetJSON format.
global-gateway-routing.json – the routing configuration of the global gateway in the NetJSON format.
redteam-attack-schedule.{csv,odt} – the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.
redteam-reserved-ip-ranges.{csv,odt} – the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.
topology.{json,pdf,png} – the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.
topology-small.{pdf,png} – simplified topology in the PDF and PNG format. Source for Figure 1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This respository contains the CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set). The data set contains log events from real users utilizing a cloud storage suitable for User Entity Behavior Analytics (UEBA). Events include logins, file accesses, link shares, config changes, etc. The data set contains around 50 million events generated by more than 5000 distinct users in more than five years (2017-07-07 to 2022-09-29 or 1910 days). The data set is complete except for 109 events missing on 2021-04-22, 2021-08-20, and 2021-09-05 due to database failure. The unpacked file size is around 14.5 GB. A detailed analysis of the data set is provided in [1].
The logs are provided in JSON format with the following attributes in the first level:
In the following data sample, the first object depicts a successful user login (see type: login_successful) and the second object depicts a file access (see type: file_accessed) from a remote location:
{"params": {"user": "intact-gray-marlin-trademarkagent"}, "type": "login_successful", "time": "2019-11-14T11:26:43Z", "uid": "intact-gray-marlin-trademarkagent", "id": 21567530, "uidType": "name"}
{"isLocalIP": false, "params": {"path": "/proud-copper-orangutan-artexer/doubtful-plum-ptarmigan-merchant/insufficient-amaranth-earthworm-qualitycontroller/curious-silver-galliform-tradingstandards/incredible-indigo-octopus-printfinisher/wicked-bronze-sloth-claimsmanager/frantic-aquamarine-horse-cleric"}, "type": "file_accessed", "time": "2019-11-14T11:26:51Z", "uid": "graceful-olive-spoonbill-careersofficer", "id": 21567531, "location": {"countryCode": "AT", "countryName": "Austria", "region": "4", "city": "Gmunden", "latitude": 47.915, "longitude": 13.7959, "timezone": "Europe/Vienna", "postalCode": "4810", "metroCode": null, "regionName": "Upper Austria", "isInEuropeanUnion": true, "continent": "Europe", "accuracyRadius": 50}, "uidType": "ipaddress"}
The data set was generated at the premises of Huemer Group, a midsize IT service provider located in Vienna, Austria. Huemer Group offers a range of Infrastructure-as-a-Service solutions for enterprises, including cloud computing and storage. In particular, their cloud storage solution called hBOX enables customers to upload their data, synchronize them with multiple devices, share files with others, create versions and backups of their documents, collaborate with team members in shared data spaces, and query the stored documents using search terms. The hBOX extends the open-source project Nextcloud with interfaces and functionalities tailored to the requirements of customers.
The data set comprises only normal user behavior, but can be used to evaluate anomaly detection approaches by simulating account hijacking. We provide an implementation for identifying similar users, switching pairs of users to simulate changes of behavior patterns, and a sample detection approach in our github repo.
Acknowledgements: Partially funded by the FFG project DECEPT (873980). The authors thank Walter Huemer, Oskar Kruschitz, Kevin Truckenthanner, and Christian Aigner from Huemer Group for supporting the collection of the data set.
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, G. Höld, and M. Wurzenberger. "A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing". 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor Multivariate time series dataset for space weather data analytics. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the results of 311 tests conducted to evaluate a server monitoring system that integrates artificial intelligence for incident detection and analysis. The system uses tools such as Grafana and Prometheus for metric collection, Grafana Loki for log management, and the OpenAI API for log analysis. The dataset includes metrics on CPU usage, memory, storage, and service logs, as well as response times for alerts sent via Telegram and the GPT model's analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📝 Description: This dataset captures simulated student interactions in a digital learning environment. Each row represents a unique learning session, containing comprehensive information about student behavior, engagement, performance, and progression over time.
The dataset is designed to support research and development in personalized education, adaptive learning systems, student engagement analysis, and feedback optimization. It enables the study of learning patterns and offers insights into how students interact with digital content, how they perform in assessments, and how their learning behavior evolves across sessions.
🌟 Key Features: Feature Description student_id Unique identifier for each student session_id Unique ID for each learning session timestamp Date and time of the session module_id Course/module accessed during the session time_spent_minutes Time spent in the session (in minutes) pages_visited Number of content pages visited video_watched_percent Percentage of video watched during the session click_events Number of interactions (clicks, navigations, etc.) notes_taken Whether the student took notes (1 = yes, 0 = no) forum_posts Number of forum posts/comments made revisit_flag Indicates if content was revisited quiz_score Score obtained in the session quiz (0–100) attempts_taken Number of quiz attempts made assignment_score Score in the session’s assignment (0–100) feedback_rating Student’s feedback rating for the session (1–5) days_since_last_activity Number of days since last session cumulative_quiz_score Running total of all previous quiz scores learning_trend Average performance across sessions attention_score Derived indicator of engagement during the session feedback_type Type of feedback given (e.g., revise topic, pace slow) next_module_prediction Suggested next module for the student success_label Indicator of learning success (1 = successful, 0 = not)
📊 Dataset Overview: Total Records: 9,000+ learning sessions
Total Students: 300
Total Features: 22
Data Format: CSV
Time-Series Ready: Yes (sequential session data per student)
💡 Use Cases: Analyze and visualize student learning patterns
Evaluate content engagement and session behavior
Develop personalized learning dashboards or analytics tools
Simulate adaptive feedback systems for digital education