Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CPU utilization time series dataset for anomaly detection
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains time-series system performance metrics collected from an AWS EC2 instance over the course of one full day. The primary focus is to support Trend Detection, Seasonality Analysis, and Pattern Recognition tasks under the AIOps (Artificial Intelligence for IT Operations) domain. 📥 Dataset Contents:
Timestamp: Time of log entry (every few seconds)
CPU Usage (%): Real-time CPU utilization of the EC2 instance
Memory Usage (%): Real-time memory consumption
Disk Usage (%): Real-time disk space utilization
The data was collected using custom Python scripts that automatically introduced usage spikes via background processes (using stress and dd commands) to simulate real-world high-load scenarios.
Facebook
Twitterhttps://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
The largest real-world dataset for multivariate time series anomaly detection (MTSAD) from the AIOps system of a Real-Time Data Warehouse (RTDW) from a top cloud computing company. All the metrics and labels in our dataset are derived from real-world scenarios. All metrics were obtained from the RTDW instance monitoring system and cover a rich variety of metric types, including CPU usage, queries per second (QPS) and latency, which are related to many important modules within RTDW AIOps Dataset. We obtain labels from the ticket system, which integrates three main sources of instance anomalies: user service requests, instance unavailability and fault simulations . User service requests refer to tickets that are submitted directly by users, whereas instance unavailability is typically detected through existing monitoring tools or discovered by Site Reliability Engineers (SREs). Since the system is usually very stable, we augment the anomaly samples by conducting fault simulations. Fault simulation refers to a special type of anomaly, planned beforehand, which is introduced to the system to test its performance under extreme conditions. All records in the ticket system are subject to follow-up processing by engineers, who meticulously mark the start and end times of each ticket. This rigorous approach ensures the accuracy of the labels in our dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TimeTrack is a publicly available dataset collected from an OpenAirInterface (OAI) cluster running CI/CD workloads. It includes metrics such as CPU, memory, disk usage, and latency, recorded at 45-second intervals from seven computing nodes during 30 days. The cluster was running OpenShift. If you use this dataset, please cite the paper: "TimeTrack: A Dataset for Exploring Temporal Patterns and Predictive Insights into OpenAirInterface (OAI) CI/CD Cluster."
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Just data to finish a task on an IoT course held by SIC Egypt The data contains the CPU metrics from my laptop, such as CPU usage, syscalls, and interrupts. it should be used to try 2 different ways of doing linear regression Time series on lag data and simple regression based on other metrics to predict the CPU usage.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
harpertokenSysMon Dataset
Dataset Summary
This open-source dataset captures real-time system metrics from macOS for time-series analysis, anomaly detection, and predictive maintenance.
Dataset Features
OS Compatibility: macOS
Data Collection Interval: 1-5 seconds
Total Storage Limit: 4GB
File Format: CSV & Parquet
Data Fields:
timestamp: Date and time of capture
cpu_usage: CPU usage percentage per core
memory_used_mb: RAM usage in MB… See the full description on the dataset page: https://huggingface.co/datasets/harpertoken/harpertokenSysMon.
Facebook
Twitterhttps://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8Vhttps://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8V
Dataset containing real-world and synthetic samples on legit and malware samples in the form of time series. The samples consider machine-level performance metrics: CPU usage, RAM usage, number of bytes read and written from and to disk and network. Synthetic samples are generated using a GAN.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average CPU time (s) of all the referenced algorithm on benchmark function.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
For full details of the data please refer to the paper "The MIT Supercloud Dataset", available at https://ieeexplore.ieee.org/abstract/document/9622850 or https://arxiv.org/abs/2108.02037
Dataset
Datacenter monitoring systems offer a variety of data streams and events. The Datacenter Challenge datasets are a combination of high-level data (e.g. Slurm Workload Manager scheduler data) and low-level job-specific time series data. The high-level data includes parameters such as the number of nodes requested, number of CPU/GPU/memory requests, exit codes, and run time data. The low-level time series data is collected on the order of seconds for each job. This granular time series data includes CPU/GPU/memory utilization, amount of disk I/O, and environmental parameters such as power drawn and temperature. Ideally, leveraging both high-level scheduler data and low-level time series data will facilitate the development of AI/ML algorithms which not only predict/detect failures, but also allow for the accurate determination of their cause.
Here I will only include the high-level data.
If you are interested in using the dataset, please cite this paper.
@INPROCEEDINGS{9773216,
author={Li, Baolin and Arora, Rohin and Samsi, Siddharth and Patel, Tirthak and Arcand, William and Bestor, David and Byun, Chansup and Roy, Rohan Basu and Bergeron, Bill and Holodnak, John and Houle, Michael and Hubbell, Matthew and Jones, Michael and Kepner, Jeremy and Klein, Anna and Michaleas, Peter and McDonald, Joseph and Milechin, Lauren and Mullen, Julie and Prout, Andrew and Price, Benjamin and Reuther, Albert and Rosa, Antonio and Weiss, Matthew and Yee, Charles and Edelman, Daniel and Vanterpool, Allan and Cheng, Anson and Gadepally, Vijay and Tiwari, Devesh},
booktitle={2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)},
title={AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications},
year={2022},
volume={},
number={},
pages={1224-1237},
doi={10.1109/HPCA53966.2022.00093}}
Reference: https://dcc.mit.edu/ https://github.com/boringlee24/HPCA22_SuperCloud
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains input and analysis scripts supporting the findings of Thermal transport of glasses via machine learning driven simulations, by P. Pegolo and F. Grasselli. Content:
README.md: this file, information about the repository SiO2: vitreous silica parent folder
NEP: folder with datasets and input scripts for NEP training
train.xyz: training dataset test.xyz: validation dataset nep.in: NEP input script nep.txt: NEP model nep.restart: NEP restart file DP: folder with datasets and input scripts for DP training
input.json: DeePMD training input dataset: DeePMD training dataset validation: DeePMD validation dataset frozen_model.pb: DP model GKMD: scripts for the GKMD simulations Tersoff: Tersoff reference simulation
model.xyz: initial configuration run.in: GPUMD script SiO2.gpumd.tersoff88: Tersoff model parameters convert_movie_to_dump.py: script to convert GPUMD XYZ trajectory to LAMMPS format for re-running the trajectory with the MLPs DP: DP simulation
init.data: LAMMPS initial configuration in.lmp: LAMMPS input to re-run the Tersoff trajectory with the DP NEP: NEP simulation
init.data: LAMMPS initial configuration in.lmp: LAMMPS input to re-run the Tersoff trajectory with the NEP. Note that this needs the NEP-CPU user package installed in LAMMPS. At the moment it is not possible to re-run a trajectory with GPUMD. QHGK: scripts for the QHGK simulations
DP: DP data
second.npy: second-order interatomic force constants third.npy: third-order interatomic force constants replicated_atoms.xyz: configuration dynmat: scripts to compute interatomic force constants with the DP model. Analogous scripts were used also to compute IFCs with the other potentials.
initial.data: non optimized configuration in.dynmat.lmp: LAMMPS script to minimize the structure and compute second-order interatomic force constants in.third.lmp: LAMMPS script to compute third-order interatomic force constants Tersoff: Tersoff data
second.npy: second-order interatomic force constants third.npy: third-order interatomic force constants replicated_atoms.xyz: configuration NEP: NEP data
second.npy: second-order interatomic force constants third.npy: third-order interatomic force constants replicated_atoms.xyz: configuration qhgk.py: script to compute QHGK lifetimes and thermal conductivity Si: vitreous silicon parent folder
QHGK: scripts for the QHGK simulations
qhgk.py: script to compute QHGK lifetimes [N]: folder with the calculations on a N-atoms system
second.npy: second-order interatomic force constants third.npy: third-order interatomic force constants replicated_atoms.xyz: configuration LiSi: vitreous litihum-intercalated silicon parent folder
NEP: folder with datasets and input scripts for NEP training
train.xyz: training dataset test.xyz: validation dataset nep.in: NEP input script nep.txt: NEP model nep.restart: NEP restart file EMD: folder with data on the equilibrium molecular dynamics simulations
70k: data of the simulations with ~70k atoms
1-45: folder with input scripts for the simulations at different Li concentration
fraction.dat: Li fraction, y, as in Li_{y}Si quench: scripts for the melt-quench-anneal sample preparation
model.xyz: initial configuration restart.xyz: final configuration run.in: GPUMD input gk: scripts for the GKMD simulation
model.xyz: initial configuration restart.xyz: final configuration run.in: GPUMD input cepstral: folder for cepstral analysis
analyze.py: python script for cepstral analysis of the fluxes' time-series generated by the GKMD runs
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets used in Mirzadeh et al., 2022. It includes three InSAR time-series datasets from the Envisat descending orbit, ALOS-1 ascending orbit, and Sentinel-1A in ascending and descending orbits, acquired over the Abarkuh Plain, Iran, as well as the geological map of the study area and the GNSS and hydrogeological data used in this research.
Dataset 1: Envisat descending track 292
Date: 06 Oct 2003 - 05 Sep 2005 (12 acquisitions)
Processor: ISCE/stripmapStack + MintPy
Displacement time-series (in HDF-EOS5 format): timeseries_LOD_tropHgt_ramp_demErr.h5
Mean LOS Velocity (in HDF-EOS5 format): velocity.h5
Mask Temporal Coherence (in HDF-EOS5 format): maskTempCoh.h5
Geometry (in HDF-EOS5 format): geometryRadar.h5
Dataset 2: ALOS-1 ascending track 569
Date: 06 Dec 2006 - 17 Dec 2010 (14 acquisitions)
Processor: ISCE/stripmapStack + MintPy
Displacement time-series (in HDF-EOS5 format): timeseries_ERA5_ramp_demErr.h5
Mean LOS Velocity (in HDF-EOS5 format): velocity.h5
Mask Temporal Coherence (in HDF-EOS5 format): maskTempCoh.h5
Geometry (in HDF-EOS5 format): geometryRadar.h5
Dataset 2: Sentinel-1 ascending track 130 and descending track 137
Date: 14 Oct 2014 - 28 Mar 2020 (129 ascending acquisitions) + 27 Oct 2014 - 29 Mar 2020 (114 descending acquisitions)
Processor: ISCE/topsStack + MintPy
Displacement time-series (in HDF-EOS5 format): timeseries_ERA5_ramp_demErr.h5
Mean LOS Velocity (in HDF-EOS5 format): velocity.h5
Mask Temporal Coherence (in HDF-EOS5 format): maskTempCoh.h5
Geometry (in HDF-EOS5 format): geometryRadar.h5
The time series and Mean LOS Velocity (MVL) products can be georeferenced and resampled using the makTempCoh and geometryRadar products and the MintPy commands/functions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid development of Digital Twin (DT) technology has underlined challenges in resource-constrained mobile devices, especially in the application of extended realities (XR), which includes Augmented Reality (AR) and Virtual Reality (VR). These challenges lead to computational inefficiencies that negatively impact user experience when dealing with sizeable 3D model assets. This article applies multiple lossless compression algorithms to improve the efficiency of digital twin asset delivery in Unity’s AssetBundle and Addressable asset management frameworks. In this study, an optimal model will be obtained that reduces both bundle size and time required in visualization, simultaneously reducing CPU and RAM usage on mobile devices. This study has assessed compression methods, such as LZ4, LZMA, Brotli, Fast LZ, and 7-Zip, among others, for their influence on AR performance. This study also creates mathematical models for predicting resource utilization, like RAM and CPU time, required by AR mobile applications. Experimental results show a detailed comparison among these compression algorithms, which can give insights and help choose the best method according to the compression ratio, decompression speed, and resource usage. It finally leads to more efficient implementations of AR digital twins on resource-constrained mobile platforms with greater flexibility in development and a better end-user experience. Our results show that LZ4 and Fast LZ perform best in speed and resource efficiency, especially with RAM caching. At the same time, 7-Zip/LZMA achieves the highest compression ratios at the cost of slower loading. Brotli emerged as a strong option for web-based AR/VR content, striking a balance between compression efficiency and decompression speed, outperforming Gzip in WebGL contexts. The Addressable Asset system with LZ4 offers the most efficient balance for real-time AR applications. This study will deliver practical guidance on optimal compression method selection to improve user experience and scalability for AR digital twin implementations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Autonomous Underwater Vehicle (AUV) Monterey Bay Time Series from Feb 2016. This data set includes CTD and fluorometer data from the Makai AUV, as context for ecogenomic sampling using an onboard Environmental Sample Processor (ESP).
Facebook
TwitterThe Numenta Anomaly Benchmark (NAB) is a novel benchmark for evaluating algorithms for anomaly detection in streaming, online applications. It is comprised of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. All of the data and code is fully open-source, with extensive documentation, and a scoreboard of anomaly detection algorithms: github.com/numenta/NAB. The full dataset is included here, but please go to the repo for details on how to evaluate anomaly detection algorithms on NAB.
The NAB corpus of 58 timeseries data files is designed to provide data for research in streaming anomaly detection. It is comprised of both real-world and artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.
The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. All data is included in the repository, with more details in the data readme. We are in the process of adding more data, and actively searching for more data. Please contact us at nab@numenta.org if you have similar data (ideally with known anomalies) that you would like to see incorporated into NAB.
The NAB version will be updated whenever new data (and corresponding labels) is added to the corpus; NAB is currently in v1.0.
realAWSCloudwatch/
AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes.
realAdExchange/
Online advertisement clicking rates, where the metrics are cost-per-click (CPC) and cost per thousand impressions (CPM). One of the files is normal, without anomalies.
realKnownCause/
This is data for which we know the anomaly causes; no hand labeling.
ambient_temperature_system_failure.csv: The ambient temperature in an office
setting.cpu_utilization_asg_misconfiguration.csv: From Amazon Web Services (AWS)
monitoring CPU usage – i.e. average CPU usage across a given cluster. When
usage is high, AWS spins up a new machine, and uses fewer machines when usage
is low.ec2_request_latency_system_failure.csv: CPU usage data from a server in
Amazon's East Coast datacenter. The dataset ends with complete system failure
resulting from a documented failure of AWS API servers. There's an interesting
story behind this data in the "http://numenta.com/blog/anomaly-of-the-week.html">Numenta
blog.machine_temperature_system_failure.csv: Temperature sensor data of an
internal component of a large, industrial mahcine. The first anomaly is a
planned shutdown of the machine. The second anomaly is difficult to detect and
directly led to the third anomaly, a catastrophic failure of the machine.nyc_taxi.csv: Number of NYC taxi passengers, where the five anomalies occur
during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow
storm. The raw data is from the NYC Taxi and Limousine Commission.
The data file included here consists of aggregating the total number of
taxi passengers into 30 minute buckets.rogue_agent_key_hold.csv: Timing the key holds for several users of a
computer, where the anomalies represent a change in the user.rogue_agent_key_updown.csv: Timing the key strokes for several users of a
computer, where the anomalies represent a change in the user.realTraffic/
Real time traffic data from the Twin Cities Metro area in Minnesota, collected by the Minnesota Department of Transportation. Included metrics include occupancy, speed, and travel time from specific sensors.
realTweets/
A collection of Twitter mentions of large publicly-traded companies such as Google and IBM. The metric value represents the number of mentions for a given ticker symbol every 5 minutes.
artificialNoAnomaly/
Artifically-generated data without any anomalies.
artificialWithAnomaly/
Artifically-generated data with varying types of anomalies.
We encourage you to publish your results on running NAB, and share them with us at nab@numenta.org. Please cite the following publication when referring to NAB:
Lavin, Alexander and Ahmad, Subutai. "Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark", Fourteenth International Conference on Machine Learning and Applications, December 2015. [PDF]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of multiple linear regression analysis for total time prediction: Effects of vertex count and video size.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Prior works have noted that existing public traces on anomaly detection and bottleneck localization in microservices applications only contain single, severe bottlenecks that are not representative of real-world scenarios. When such a bottleneck is introduced, the resulting latency increases by an order of magnitude (100x), making it trivial to detect that single bottleneck using a simple grid search or threshold-based approaches.
To create a more realistic dataset that includes traces with multiple bottlenecks at different intensities, we carefully benchmarked the social networking application under different interference intensities and duration of interference. We chose intensities and duration values that degrade the application performance but do not cause any faults or errors that can be trivially detected. We induced interference on different VMs at different times and also simultaneously. A single VM could be induced with different types of interference (e.g., CPU and memory), resulting in the hosted microservices experiencing a mixture of interference patterns. The resulting dataset consists of around 40 million request traces along with corresponding time series of CPU, memory, I/O, and network metrics. The dataset also includes application, VM, and Kubernetes logs.
A detailed description of the files is provided in the Data Explorer section. Please reach out to gagan at cs dot stonybrook dot edu if you have any questions or concerns.
If you find the dataset useful, please cite our WWW'24 paper "GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications." Citation format (bibtex):
author = {Somashekar, Gagan and Dutt, Anurag and Adak, Mainak and Lorido Botran, Tania and Gandhi, Anshul},
title = {GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications.},
year = {2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3589334.3645665},
doi = {10.1145/3589334.3645665},
booktitle = {Proceedings of the ACM Web Conference 2024},
location = {Singapore},
series = {WWW '24}
}```
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Net-Income-Applicable-To-Common-Shares Time Series for Ingenic Semiconductor. Ingenic Semiconductor Co.,Ltd. engages in the research and development, design, and sale of integrated circuit chip products in China and internationally. It offers multi-core crossover IoT micro-processor, multi-core heterogeneous crossover micro-processor, low-power AIoT micro-processor, low power image recognition micro-processor, ultra-low-power IoT micro-processor, low power AI video processor, 4K video and AI vision application processor, balanced video processor, dual camera low power video processor, 2K HEVC video-IOT MCU, and professional security backend processor. The company also provides computing, storage, analog, and interconnect chips. Its products are used in automotive electronics, industrial and medical, communication equipment, consumer electronics, and other fields. The company was founded in 2005 and is headquartered in Beijing, China.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 1440 rows of time-series metrics collected from a multi-tenant cloud environment of concealed resource overuse.
Key features include:
Timestamped resource metrics (CPU, memory, disk I/O, network I/O)
Multiple users (tenants) to simulate shared infrastructure
Workload labels (e.g., Web Service, Backup, Crypto Mining)
Anomaly labels indicating resource overuse, including hidden anomalies
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The SmartSys-CTI dataset is a synthetically generated yet realistic dataset created for research and development in anomaly detection and cyber threat intelligence (CTI) within smart system environments. It simulates activity logs and network behavior from smart devices commonly found in IoT-enabled infrastructures such as smart homes, industrial IoT, smart grids, and healthcare systems.
It includes both normal operational data and anomalous activity patterns such as Denial-of-Service (DoS), spoofing, and data injection, making it ideal for training and evaluating intelligent intrusion detection systems (IDS).
⭐ Key Features 🔐 Cyber Threat Scenarios Includes labeled data for multiple cyberattacks: DoS, spoofing, injection.
📊 Rich Feature Set Covers CPU/memory usage, network traffic, packet rate, encryption status, location variance, and more.
🧠 Deep Learning Ready Designed for Capsule Networks (CapsNet), Extreme Learning Machines (ELM), and other hybrid deep models.
⏱️ Time-Series Support Timestamped logs simulate real-time operations for sequential models (e.g., RNNs, LSTMs).
🧪 Multi-Class Labels Provides a labeled target column for normal vs specific attack types, aiding multiclass classification.
⚡ Scalable and Lightweight Efficient format suitable for real-time detection system prototyping and testing.
This dataset provides a practical foundation for developing scalable, accurate, and adaptive cybersecurity solutions in modern smart environments. Researchers and practitioners can use it to evaluate model performance, test feature extraction techniques, or simulate real-time defense systems.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of multiple linear regression analysis for maximum RAM prediction: Effects of vertex count and video size.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CPU utilization time series dataset for anomaly detection