Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.
The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.
These features describe network-level information such as packet size, protocol type, and encryption methods.
network_packet_size (Packet Size in Bytes)
protocol_type (Communication Protocol)
encryption_used (Encryption Protocol)
These features track user activities, such as login attempts and session duration.
login_attempts (Number of Logins)
session_duration (Session Length in Seconds)
failed_logins (Failed Login Attempts)
unusual_time_access (Login Time Anomaly)
0 or 1) indicating whether access happened at an unusual time.ip_reputation_score (Trustworthiness of IP Address)
browser_type (User’s Browser)
attack_detected)1 means an attack was detected, 0 means normal activity.This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:
Supervised Learning Approaches
attack_detected as the target).Deep Learning Approaches
If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
❗️❗️❗️**The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the upcoming update to v1.0 (probably up to the summer of 2021) will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.**
We propose the Skoltech Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies): * Outlier detection (anomalies considered and marked up as single-point anomalies) * Changepoint detection (anomalies considered and marked up as collective anomalies)
SKAB consists of the following artifacts: * Datasets. * Leaderboard (scoreboard). * Python modules for algorithms’ evaluation. * Notebooks: python notebooks with anomaly detection algorithms.
The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech). All the details regarding the testbed and the experimenting process are presented in the following artifacts: - Position paper (currently submitted for publication) - Slides about the project
The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The data folder contains datasets from the benchmark. The structure of the data folder is presented in the structure file. Columns in each data file are following:
- datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss)
- Accelerometer1RMS - Shows a vibration acceleration (Amount of g units)
- Accelerometer2RMS - Shows a vibration acceleration (Amount of g units)
- Current - Shows the amperage on the electric motor (Ampere)
- Pressure - Represents the pressure in the loop after the water pump (Bar)
- Temperature - Shows the temperature of the engine body (The degree Celsius)
- Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius)
- Voltage - Shows the voltage on the electric motor (Volt)
- RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute)
- anomaly - Shows if the point is anomalous (0 or 1)
- changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)
Here we propose the leaderboard for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on kaggle. The results in the tables are calculated in the python notebooks from the notebooks folder.
Sorted by F1; for F1 bigger is better; both for FAR and MAR less is better
| Algorithm | F1 | FAR, % | MAR, %
|---|---|---|---|
Perfect detector | 1 | 0 | 0
T-squared+Q (PCA) | 0.67 | 13.95 | 36.32
LSTM | 0.64 | 15.4 | 39.93
MSCRED | 0.64 | 13.56 | 41.16
T-squared | 0.56 | 12.14 | 52.56
Autoencoder | 0.45 | 7.56 | 66.57
Isolation forest | 0.4 | 6.86 | 72.09
Null detector | 0 | 0 | 100
Sorted by NAB (standart); for all metrics bigger is better
| Algorithm | NAB (standart) | NAB (lowFP) | NAB (LowFN) |
|---|---|---|---|
Perfect detector | 100 | 100 | 100
Isolation forest | 37.53 | 17.09 | 45.02
MSCRED | 28.74 | 23.43 | 31.21
LSTM | 27.09 | 11.06 | 32.68
T-squared+Q (PCA) | 26.71 | 22.42 | 28.32
T-squared | 17.87 | 3.44 | 23.2
ArimaFD | 16.06 | 14.03 | 17.12
Autoencoder | 15.59 | 0.78 | 20.91
Null detector | 0 | 0 | 0
The notebooks folder contains python notebooks with the code for the proposed leaderboard results reproducing.
We have calculated the results for five quite common anomaly detection algorithms: - Hotelling's T-squared statistics; - Hotelling's T-squared statistics + Q statistics based on PCA; - Isolation forest; - LSTM-based NN; - Feed-Forward Autoencoder.
Additionaly to the repository were added the results of the following algorithms: - ArimaFD; - MSCRED.
Facebook
TwitterThe aim of this study was to design a new canine posture estimation system specifically for working dogs. The system was composed of Inertial Measurement Units (IMUs) that are commercially available, and a supervised learning algorithm which was developed for different behaviours. Three IMUs, each containing a 3-axis accelerometer, gyroscope, and magnetometer, were attached to the dogs’ chest, back, and neck. To build and test the model, data were collected during a video-recorded behaviour test where the trainee assistance dogs performed static postures (standing, sitting, lying down) and dynamic activities (walking, body shake). Advanced feature extraction techniques were employed for the first time in this field, including statistical, temporal, and spectral methods. The most important features for posture prediction were chosen using Select K Best with ANOVA F-value. The individual contributions of each IMU, sensor, and feature type were analysed using Select K Best scores and Random Forest feature importance. Results showed that the back and chest IMUs were more important than the neck IMU, and the accelerometers were more important than the gyroscopes. The addition of IMUs to the chest and back of dog harnesses is recommended to improve performance. Additionally, statistical and temporal feature domains were more important than spectral feature domains. Three novel cascade arrangements of Random Forest and Isolation Forest were fitted to the dataset. The best classifier achieved an f1-macro of 0.83 and an f1-weighted of 0.90 for the prediction of the five postures, demonstrating a better performance than previous studies. These results were attributed to the data collection methodology (number of subjects and observations, multiple IMUs, use of common working dog breeds) and novel machine learning techniques (advanced feature extraction, feature selection and modelling arrangements) employed. The dataset and code used are publicly available on Mendeley Data and GitHub, respectively.
Facebook
Twitter📌 Overview This dataset provides a real-world representation of credit card transactions, labeled as fraudulent or legitimate. It is designed to aid in the development of machine learning models for fraud detection and financial security applications. Given the rising cases of online fraud, detecting suspicious transactions is crucial for financial institutions.
This dataset allows users to experiment with various fraud detection techniques, such as supervised and unsupervised learning models, anomaly detection, and pattern recognition.
📊 Dataset Details
Number of Transactions: 1852394 Number of Features: 23 Fraudulent Transactions: Contains transactions labeled as is_fraud = 1 for fraud and is_fraud = 0 for legitimate payments.
📁 Columns Explained Transaction Information:
trans_date_trans_time – Timestamp of the transaction cc_num – Unique (anonymized) credit card number merchant – Merchant where the transaction occurred category – Type of transaction (e.g., travel, food, personal care) amt – Transaction amount
Cardholder Details:
first, last – First and last name of the cardholder gender – Gender of the cardholder street, city, state, zip – Address of the cardholder lat, long – Geographical location of the cardholder city_pop – Population of the cardholder’s city job – Profession of the cardholder dob – Date of birth of the cardholder
Transaction Identifiers & Timing:
trans_num – Unique transaction identifier unix_time – Timestamp of transaction in Unix format
Merchant Details:
merch_lat, merch_long – Merchant's location (latitude & longitude)
Fraud Indicator:
is_fraud – Target variable (1 = Fraud, 0 = Legitimate)
🎯 Usage
This dataset is ideal for: ✅ Fraud detection research ✅ Machine learning model development ✅ Anomaly detection projects ✅ Financial analytics
🛠️ Suggested Machine Learning Approaches
Supervised Learning:
Logistic Regression Decision Trees / Random Forest XGBoost / LightGBM Deep Learning (Neural Networks)
Unsupervised Learning:
Autoencoders Isolation Forest DBSCAN for anomaly detection
Feature Engineering Ideas:
Creating transaction frequency features Aggregating spending behavior per merchant/category Analyzing location-based fraud patterns
⚠️ Disclaimer This dataset has been anonymized and should be used strictly for research and educational purposes. It does not contain any real-world personal information, and the credit card numbers have been randomly generated for simulation purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of observations after feature extraction per dataset per posture.
Facebook
TwitterTo advance our understanding of forest cover changes, given the discrepancies, this work providesan original analysis by assessing five available remote sensing datasets (ALOS PALSAR forest and non-forest data, ESA CCI Land Cover, MODIS IGBP, Hansen/GFW on global tree cover loss, and Terra-I) toestimate the likely extent of current forests (circa 2018) and forest cover loss from 2001-2018, forwhich data was available. This assumes that no single approach or data source can capture majortrends everywhere; therefore, an all-available data approach is needed to overcome shortcomings ofindividual datasets. The main shortcomings of this approach, however, are that it does not account for forest gains, tends tounderestimate the conversion in dry forests ecosystems and lacks explicit assessment ofuncertainties across the different datasets.“Forest cover loss” in the all-available data analysis consists of observations (pixels) changing fromforest to non-forest at any time during 2000 to 2018. The spatial resolution chosen was 250m giventhe original resolutions of the datasets incorporated and on the understanding that forest areasshould be a minimum of 250 x250m (6.25 ha) to contain the functional attributes of a forest (e.g.species distribution, ecology, ecosystem services), rather than depicting individual trees or groups oftrees.According to our analysis, about 20% of total forest cover loss takes place in core forest, which welabel “primary forest loss”, while the remaining 80% results from the conversion of edge andpatched forests, which is labelled as “secondary forest loss”. Two thirds of total forest cover loss inthe period from 2000-2018 occurred in the tropics and subtropics, followed by boreal and temperateforests. A portion of the loss in temperate and boreal forests will not be permanent and might referto other types of natural forest disturbances produced by insects, fire, and severe weather, as wellas by felling of plantations or semi-natural forests as part of forest management.Much tropical forest cover loss is in South America and Asia, while subtropical forest cover loss ismainly in South America and Africa. When looking at countries by income levels, as defined by theWorld Bank, much of deforestation takes place in upper middle and lower middle-income countries.To the risk of simplifying, this suggests an increasing pressure on forests in the transition that occurswhen countries increase economic development. In the tropics, upper-middle income countriesdominate forest cover loss in South America, due to the influence of Brazil, and lower middle-income countries in Asia, due to the influence of Indonesia. Forest cover loss in the subtropics occursmainly in Brazil and Argentina in South America, many lower-middle income countries in SouthAmerica, and lower-income countries in sub-Saharan Africa. Most temperate and boreal forest coverloss, likely not all permanent, occurs in high-income countries (Russia), and North America (UnitedStates and Canada) Unfortunately, this data does not identify changes over time or land use interactions amongcountries. Reduced forest cover loss in some mainly high-income countries, except North America, isassociated with forest cover loss, particularly in lower- and upper-middle countries in the tropics. Interactions are informed by the “forest transition” effect. Forest transition dynamics occur whennet forest restoration replaces net forest cover loss in some specific place. The countries thatunderwent a forest transition that reduced forest loss and encouraged regrowth may have placedadditional pressure on forests outside their borders, thus displacing deforestation. The debate onforest transitions and leakage is quite controversial given its policy implications.Recent analysis, based on a land-balance model that quantifies deforestation due to global trade atcountry level in the tropics and sub-tropics, linked to a country-to-country trade model, found thatfrom 2005-2013, 62% of forest loss was caused by commercial agriculture, pasture and plantations.About 26% of total deforestation was attributed to international demand, 87% of which wasexported to countries with decreasing deforestation or increasing forest cover in Europe and Asia(i.e. China, India). Some of this displacement pressure may be reduced by land intensification. Global patterns of forest fragmentationIn this analysis we consider forest degradation alongside forest cover loss. Degradation is a multi-factorial phenomenon that includes amongst others loss of native species, appearance of invasivespecies, pollution damage, structural changes, selective timber removal and many more. Here weuse fragmentation as a proxy that can be detected through remote sensing; this is a critical aspect offorest degradation but does not capture all aspects. The change in spatial pattern and structure byfragmentation of forest into smaller patches or “islands” damages forest ecosystem services such ascarbon storage and climate mitigation, regulation, water provision, and habitat for biodiversity. These impacts are created by changes at forest edges, which include increased exposure to differentclimate, fire, wind, mortality, and human access. The increasing isolation of forest patchescontributes to long-term changes in biodiversity, including species richness and productivity,creating fundamental changes in forest ecosystems.We evaluated the fragmentation of forests using morphological spatial pattern analysis (MSPA)assessed on the two all-available data global forest cover maps corresponding to 2000 and 2018, todetermine forest cover transitions between different type of fragmentation classes (i.e. stable core,inner edges, outer edges, and patches). Changes between fragmentation classes over time aredefined as primary and secondary degradation based on their initial state, in contrast to forestswhich remain in the same fragmentation class as stable core, inner edge, outer edge, and patch. Inthis definition, primary degradation is a result of the fragmentation of core forests into forest withmore edges, reducing the area of continuous forest extent, and resulting in greater losses of carbonand associated ecosystem services such as biodiversity present in intact forests. Secondarydegradation is the conversion of edge forests into more fragmented classes, occurring in secondaryforests which may already be degraded and are more accessible and easier to deforest
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About Dataset
This synthetic dataset emulates a distributed, adaptive Battery Management System (BMS) architecture across multiple modules. It contains timestamped measurements and diagnostics for four battery modules over sequential 5‑minute intervals. Key attributes include voltage, temperature, current, power, control commands, health indicators, latency, and anomaly flags—providing a rich foundation for time‑series forecasting, anomaly detection, control‑chart monitoring, and reliability analysis.
This early release is intended to showcase the structure and potential workflows for distributed BMS analytics. Your insights on field completeness, data realism, schema design, and potential additional parameters (e.g., per‑cell voltages, extended telemetry) are highly valued. Please share feedback via the project repository’s issue tracker or by email. Your suggestions will guide enhancements for future versions, including expanded module counts, varying sampling rates, and integration with additional sensor data.
Each row represents one 5‑minute interval for a given module. Noteworthy fields include:
timestamp: ISO‑formatted datetime of record (UTC).module_id: Identifier for each battery module (e.g., Module_A).cell_voltage_v: Average cell voltage (V).cell_temperature_c: Mean cell temperature (°C).module_current_a: Module charge/discharge current (A).module_power_kw: Instantaneous power output/input (kW).converter_command_pct: Power converter duty cycle (%) commanded per module.soc_pct: State‑of‑Charge (%) at end of interval.soh_pct: State‑of‑Health (%) relative to nominal capacity.anomaly_score_pct: Unsupervised anomaly score (0–100%).diagnostic_flag: Boolean flag indicating potential fault (true if anomaly_score > threshold).latency_ms: Cloud communication round‑trip latency (ms).A distributed, adaptive BMS elevates module‑level intelligence to optimize safety, performance, and lifetime. To support development and research:
This initial dataset focuses on core metrics; future releases may include finer sampling, environmental context (e.g., ambient temperature), and firmware versioning.
This synthetic dataset is released under the MIT License. For any publications, presentations, or derivative works, please cite:
Synthetic Distributed Adaptive BMS Dataset (v1.0), 2025.
Contributions and issue reports are welcome via the project’s GitHub repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Type, posture, and the number of observations in the IMU Posture dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.
The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.
These features describe network-level information such as packet size, protocol type, and encryption methods.
network_packet_size (Packet Size in Bytes)
protocol_type (Communication Protocol)
encryption_used (Encryption Protocol)
These features track user activities, such as login attempts and session duration.
login_attempts (Number of Logins)
session_duration (Session Length in Seconds)
failed_logins (Failed Login Attempts)
unusual_time_access (Login Time Anomaly)
0 or 1) indicating whether access happened at an unusual time.ip_reputation_score (Trustworthiness of IP Address)
browser_type (User’s Browser)
attack_detected)1 means an attack was detected, 0 means normal activity.This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:
Supervised Learning Approaches
attack_detected as the target).Deep Learning Approaches
If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.