8 datasets found

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset
kaggle.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dinesh Naveen Kumar Samudrala
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

network_packet_size (Packet Size in Bytes)

Represents the size of network packets, ranging between 64 to 1500 bytes.

Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.

Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.

protocol_type (Communication Protocol)

The protocol used in the session: TCP, UDP, or ICMP.

TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).

UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).

ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.

encryption_used (Encryption Protocol)

Values: AES, DES, None.

AES (Advanced Encryption Standard): Strong encryption, commonly used.

DES (Data Encryption Standard): Older encryption, weaker security.

None: Indicates unencrypted communication, which can be risky.

Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

login_attempts (Number of Logins)

High values might indicate brute-force attacks (repeated login attempts).

Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.

session_duration (Session Length in Seconds)

A very long session might indicate unauthorized access or persistence by an attacker.

Attackers may try to stay connected to maintain access.

failed_logins (Failed Login Attempts)

High failed login counts indicate credential stuffing or dictionary attacks.

Many failed attempts followed by a successful login could suggest an account was compromised.

unusual_time_access (Login Time Anomaly)

A binary flag (0 or 1) indicating whether access happened at an unusual time.

Attackers often operate outside normal business hours to evade detection.

ip_reputation_score (Trustworthiness of IP Address)

A score from 0 to 1, where higher values indicate suspicious activity.

IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.

browser_type (User’s Browser)

Common browsers: Chrome, Firefox, Edge, Safari.

Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (attack_detected)

Binary classification: 1 means an attack was detected, 0 means normal activity.

The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

Supervised Learning Approaches

Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)

Train the model using labeled data (attack_detected as the target).

Evaluate using accuracy, precision, recall, F1-score.

Deep Learning Approaches

Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.

LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

Adversarial Attacks: Attackers may modify traffic to evade detection.

Concept Drift: Cyber threats...
SKAB - Skoltech Anomaly Benchmark
kaggle.com
zip
Updated Nov 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iurii Katser (2020). SKAB - Skoltech Anomaly Benchmark [Dataset]. https://www.kaggle.com/datasets/yuriykatser/skoltech-anomaly-benchmark-skab/code
Explore at:
zip(1300142 bytes)Available download formats
Dataset updated
Nov 28, 2020
Authors
Iurii Katser
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
❗️❗️❗️**The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the upcoming update to v1.0 (probably up to the summer of 2021) will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.**

About SKAB

We propose the Skoltech Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies): * Outlier detection (anomalies considered and marked up as single-point anomalies) * Changepoint detection (anomalies considered and marked up as collective anomalies)

SKAB consists of the following artifacts: * Datasets. * Leaderboard (scoreboard). * Python modules for algorithms’ evaluation. * Notebooks: python notebooks with anomaly detection algorithms.

The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech). All the details regarding the testbed and the experimenting process are presented in the following artifacts: - Position paper (currently submitted for publication) - Slides about the project

Datasets

The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The data folder contains datasets from the benchmark. The structure of the data folder is presented in the structure file. Columns in each data file are following: - datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss) - Accelerometer1RMS - Shows a vibration acceleration (Amount of g units) - Accelerometer2RMS - Shows a vibration acceleration (Amount of g units) - Current - Shows the amperage on the electric motor (Ampere) - Pressure - Represents the pressure in the loop after the water pump (Bar) - Temperature - Shows the temperature of the engine body (The degree Celsius) - Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius) - Voltage - Shows the voltage on the electric motor (Volt) - RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute) - anomaly - Shows if the point is anomalous (0 or 1) - changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)

Leaderboard (Scoreboard)

Here we propose the leaderboard for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on kaggle. The results in the tables are calculated in the python notebooks from the notebooks folder.

Outlier detection problem

Sorted by F1; for F1 bigger is better; both for FAR and MAR less is better
| Algorithm | F1 | FAR, % | MAR, % |---|---|---|---| Perfect detector | 1 | 0 | 0 T-squared+Q (PCA) | 0.67 | 13.95 | 36.32 LSTM | 0.64 | 15.4 | 39.93 MSCRED | 0.64 | 13.56 | 41.16 T-squared | 0.56 | 12.14 | 52.56 Autoencoder | 0.45 | 7.56 | 66.57 Isolation forest | 0.4 | 6.86 | 72.09 Null detector | 0 | 0 | 100

Changepoint detection problem

Sorted by NAB (standart); for all metrics bigger is better
| Algorithm | NAB (standart) | NAB (lowFP) | NAB (LowFN) | |---|---|---|---| Perfect detector | 100 | 100 | 100 Isolation forest | 37.53 | 17.09 | 45.02 MSCRED | 28.74 | 23.43 | 31.21 LSTM | 27.09 | 11.06 | 32.68 T-squared+Q (PCA) | 26.71 | 22.42 | 28.32 T-squared | 17.87 | 3.44 | 23.2 ArimaFD | 16.06 | 14.03 | 17.12 Autoencoder | 15.59 | 0.78 | 20.91 Null detector | 0 | 0 | 0

Notebooks

The notebooks folder contains python notebooks with the code for the proposed leaderboard results reproducing.

We have calculated the results for five quite common anomaly detection algorithms: - Hotelling's T-squared statistics; - Hotelling's T-squared statistics + Q statistics based on PCA; - Isolation forest; - LSTM-based NN; - Feed-Forward Autoencoder.

Additionaly to the repository were added the results of the following algorithms: - ArimaFD; - MSCRED.

Citat...
f
Categories of behaviour analysed.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Galvin, Paul; O’Flynn, Brendan; Marcato, Marinara; Tedesco, Salvatore; O’Mahony, Conor (2023). Categories of behaviour analysed. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001099745
Explore at:
Dataset updated
Jun 21, 2023
Authors
Galvin, Paul; O’Flynn, Brendan; Marcato, Marinara; Tedesco, Salvatore; O’Mahony, Conor
Description
The aim of this study was to design a new canine posture estimation system specifically for working dogs. The system was composed of Inertial Measurement Units (IMUs) that are commercially available, and a supervised learning algorithm which was developed for different behaviours. Three IMUs, each containing a 3-axis accelerometer, gyroscope, and magnetometer, were attached to the dogs’ chest, back, and neck. To build and test the model, data were collected during a video-recorded behaviour test where the trainee assistance dogs performed static postures (standing, sitting, lying down) and dynamic activities (walking, body shake). Advanced feature extraction techniques were employed for the first time in this field, including statistical, temporal, and spectral methods. The most important features for posture prediction were chosen using Select K Best with ANOVA F-value. The individual contributions of each IMU, sensor, and feature type were analysed using Select K Best scores and Random Forest feature importance. Results showed that the back and chest IMUs were more important than the neck IMU, and the accelerometers were more important than the gyroscopes. The addition of IMUs to the chest and back of dog harnesses is recommended to improve performance. Additionally, statistical and temporal feature domains were more important than spectral feature domains. Three novel cascade arrangements of Random Forest and Isolation Forest were fitted to the dataset. The best classifier achieved an f1-macro of 0.83 and an f1-weighted of 0.90 for the prediction of the five postures, demonstrating a better performance than previous studies. These results were attributed to the data collection methodology (number of subjects and observations, multiple IMUs, use of common working dog breeds) and novel machine learning techniques (advanced feature extraction, feature selection and modelling arrangements) employed. The dataset and code used are publicly available on Mendeley Data and GitHub, respectively.
Credit Card Fraud Detection
kaggle.com
zip
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tushar Bhadouria (2025). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/datasets/tusharbhadouria/credit-card-fraud-detection/code
Explore at:
zip(211766662 bytes)Available download formats
Dataset updated
Feb 28, 2025
Authors
Tushar Bhadouria
Description
📌 Overview This dataset provides a real-world representation of credit card transactions, labeled as fraudulent or legitimate. It is designed to aid in the development of machine learning models for fraud detection and financial security applications. Given the rising cases of online fraud, detecting suspicious transactions is crucial for financial institutions.

This dataset allows users to experiment with various fraud detection techniques, such as supervised and unsupervised learning models, anomaly detection, and pattern recognition.

📊 Dataset Details

Number of Transactions: 1852394 Number of Features: 23 Fraudulent Transactions: Contains transactions labeled as is_fraud = 1 for fraud and is_fraud = 0 for legitimate payments.

📁 Columns Explained Transaction Information:

trans_date_trans_time – Timestamp of the transaction cc_num – Unique (anonymized) credit card number merchant – Merchant where the transaction occurred category – Type of transaction (e.g., travel, food, personal care) amt – Transaction amount

Cardholder Details:

first, last – First and last name of the cardholder gender – Gender of the cardholder street, city, state, zip – Address of the cardholder lat, long – Geographical location of the cardholder city_pop – Population of the cardholder’s city job – Profession of the cardholder dob – Date of birth of the cardholder

Transaction Identifiers & Timing:

trans_num – Unique transaction identifier unix_time – Timestamp of transaction in Unix format

Merchant Details:

merch_lat, merch_long – Merchant's location (latitude & longitude)

Fraud Indicator:

is_fraud – Target variable (1 = Fraud, 0 = Legitimate)

🎯 Usage

This dataset is ideal for: ✅ Fraud detection research ✅ Machine learning model development ✅ Anomaly detection projects ✅ Financial analytics

🛠️ Suggested Machine Learning Approaches

Supervised Learning:

Logistic Regression Decision Trees / Random Forest XGBoost / LightGBM Deep Learning (Neural Networks)

Unsupervised Learning:

Autoencoders Isolation Forest DBSCAN for anomaly detection

Feature Engineering Ideas:

Creating transaction frequency features Aggregating spending behavior per merchant/category Analyzing location-based fraud patterns

⚠️ Disclaimer This dataset has been anonymized and should be used strictly for research and educational purposes. It does not contain any real-world personal information, and the credit card numbers have been randomly generated for simulation purposes.
f
Number of observations after feature extraction per dataset per posture.
plos.figshare.com
figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Number of observations after feature extraction per dataset per posture. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286311.t005
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of observations after feature extraction per dataset per posture.
Primary Humid Forest
globil.panda.org
Updated Jul 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Wide Fund for Nature (2020). Primary Humid Forest [Dataset]. https://globil.panda.org/maps/primary-humid-forest
Explore at:
Dataset updated
Jul 2, 2020
Dataset authored and provided by
World Wide Fund for Naturehttp://wwf.org/
Area covered

Description
To advance our understanding of forest cover changes, given the discrepancies, this work providesan original analysis by assessing five available remote sensing datasets (ALOS PALSAR forest and non-forest data, ESA CCI Land Cover, MODIS IGBP, Hansen/GFW on global tree cover loss, and Terra-I) toestimate the likely extent of current forests (circa 2018) and forest cover loss from 2001-2018, forwhich data was available. This assumes that no single approach or data source can capture majortrends everywhere; therefore, an all-available data approach is needed to overcome shortcomings ofindividual datasets. The main shortcomings of this approach, however, are that it does not account for forest gains, tends tounderestimate the conversion in dry forests ecosystems and lacks explicit assessment ofuncertainties across the different datasets.“Forest cover loss” in the all-available data analysis consists of observations (pixels) changing fromforest to non-forest at any time during 2000 to 2018. The spatial resolution chosen was 250m giventhe original resolutions of the datasets incorporated and on the understanding that forest areasshould be a minimum of 250 x250m (6.25 ha) to contain the functional attributes of a forest (e.g.species distribution, ecology, ecosystem services), rather than depicting individual trees or groups oftrees.According to our analysis, about 20% of total forest cover loss takes place in core forest, which welabel “primary forest loss”, while the remaining 80% results from the conversion of edge andpatched forests, which is labelled as “secondary forest loss”. Two thirds of total forest cover loss inthe period from 2000-2018 occurred in the tropics and subtropics, followed by boreal and temperateforests. A portion of the loss in temperate and boreal forests will not be permanent and might referto other types of natural forest disturbances produced by insects, fire, and severe weather, as wellas by felling of plantations or semi-natural forests as part of forest management.Much tropical forest cover loss is in South America and Asia, while subtropical forest cover loss ismainly in South America and Africa. When looking at countries by income levels, as defined by theWorld Bank, much of deforestation takes place in upper middle and lower middle-income countries.To the risk of simplifying, this suggests an increasing pressure on forests in the transition that occurswhen countries increase economic development. In the tropics, upper-middle income countriesdominate forest cover loss in South America, due to the influence of Brazil, and lower middle-income countries in Asia, due to the influence of Indonesia. Forest cover loss in the subtropics occursmainly in Brazil and Argentina in South America, many lower-middle income countries in SouthAmerica, and lower-income countries in sub-Saharan Africa. Most temperate and boreal forest coverloss, likely not all permanent, occurs in high-income countries (Russia), and North America (UnitedStates and Canada) Unfortunately, this data does not identify changes over time or land use interactions amongcountries. Reduced forest cover loss in some mainly high-income countries, except North America, isassociated with forest cover loss, particularly in lower- and upper-middle countries in the tropics. Interactions are informed by the “forest transition” effect. Forest transition dynamics occur whennet forest restoration replaces net forest cover loss in some specific place. The countries thatunderwent a forest transition that reduced forest loss and encouraged regrowth may have placedadditional pressure on forests outside their borders, thus displacing deforestation. The debate onforest transitions and leakage is quite controversial given its policy implications.Recent analysis, based on a land-balance model that quantifies deforestation due to global trade atcountry level in the tropics and sub-tropics, linked to a country-to-country trade model, found thatfrom 2005-2013, 62% of forest loss was caused by commercial agriculture, pasture and plantations.About 26% of total deforestation was attributed to international demand, 87% of which wasexported to countries with decreasing deforestation or increasing forest cover in Europe and Asia(i.e. China, India). Some of this displacement pressure may be reduced by land intensification. Global patterns of forest fragmentationIn this analysis we consider forest degradation alongside forest cover loss. Degradation is a multi-factorial phenomenon that includes amongst others loss of native species, appearance of invasivespecies, pollution damage, structural changes, selective timber removal and many more. Here weuse fragmentation as a proxy that can be detected through remote sensing; this is a critical aspect offorest degradation but does not capture all aspects. The change in spatial pattern and structure byfragmentation of forest into smaller patches or “islands” damages forest ecosystem services such ascarbon storage and climate mitigation, regulation, water provision, and habitat for biodiversity. These impacts are created by changes at forest edges, which include increased exposure to differentclimate, fire, wind, mortality, and human access. The increasing isolation of forest patchescontributes to long-term changes in biodiversity, including species richness and productivity,creating fundamental changes in forest ecosystems.We evaluated the fragmentation of forests using morphological spatial pattern analysis (MSPA)assessed on the two all-available data global forest cover maps corresponding to 2000 and 2018, todetermine forest cover transitions between different type of fragmentation classes (i.e. stable core,inner edges, outer edges, and patches). Changes between fragmentation classes over time aredefined as primary and secondary degradation based on their initial state, in contrast to forestswhich remain in the same fragmentation class as stable core, inner edge, outer edge, and patch. Inthis definition, primary degradation is a result of the fragmentation of core forests into forest withmore edges, reducing the area of continuous forest extent, and resulting in greater losses of carbonand associated ecosystem services such as biodiversity present in intact forests. Secondarydegradation is the conversion of edge forests into more fragmented classes, occurring in secondaryforests which may already be degraded and are more accessible and easier to deforest
Battery Management System
kaggle.com
zip
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MicAmadi (2025). Battery Management System [Dataset]. https://www.kaggle.com/datasets/micamadi/synthetic-distributed-battery-management-system
Explore at:
zip(15253 bytes)Available download formats
Dataset updated
May 17, 2025
Authors
MicAmadi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About Dataset

Dataset Summary

This synthetic dataset emulates a distributed, adaptive Battery Management System (BMS) architecture across multiple modules. It contains timestamped measurements and diagnostics for four battery modules over sequential 5‑minute intervals. Key attributes include voltage, temperature, current, power, control commands, health indicators, latency, and anomaly flags—providing a rich foundation for time‑series forecasting, anomaly detection, control‑chart monitoring, and reliability analysis.

Invitation for Feedback

This early release is intended to showcase the structure and potential workflows for distributed BMS analytics. Your insights on field completeness, data realism, schema design, and potential additional parameters (e.g., per‑cell voltages, extended telemetry) are highly valued. Please share feedback via the project repository’s issue tracker or by email. Your suggestions will guide enhancements for future versions, including expanded module counts, varying sampling rates, and integration with additional sensor data.

Data Fields

Each row represents one 5‑minute interval for a given module. Noteworthy fields include:

timestamp: ISO‑formatted datetime of record (UTC).

module_id: Identifier for each battery module (e.g., Module_A).

cell_voltage_v: Average cell voltage (V).

cell_temperature_c: Mean cell temperature (°C).

module_current_a: Module charge/discharge current (A).

module_power_kw: Instantaneous power output/input (kW).

converter_command_pct: Power converter duty cycle (%) commanded per module.

soc_pct: State‑of‑Charge (%) at end of interval.

soh_pct: State‑of‑Health (%) relative to nominal capacity.

anomaly_score_pct: Unsupervised anomaly score (0–100%).

diagnostic_flag: Boolean flag indicating potential fault (true if anomaly_score > threshold).

latency_ms: Cloud communication round‑trip latency (ms).

Curation Rationale

A distributed, adaptive BMS elevates module‑level intelligence to optimize safety, performance, and lifetime. To support development and research:

Real‑world alignment: Fields mirror those in advanced BMS deployments—per‑module sensors, converter commands, and cloud telemetry.

Analytical readiness: Time‑series indices and health indicators facilitate direct use in forecasting, SPC, and anomaly workflows.

Extendibility: Schema is designed for easy augmentation (e.g., per‑cell voltages, voltage imbalance metrics, additional sensors).

This initial dataset focuses on core metrics; future releases may include finer sampling, environmental context (e.g., ambient temperature), and firmware versioning.

Intended Use Cases

Time‑Series Forecasting: Model SoH/SOC degradation and predict remaining useful life.

Anomaly Detection: Develop and benchmark algorithms (control charts, isolation forests, autoencoders).

Control‑Loop Simulation: Test adaptive converter command strategies and closed‑loop performance.

Reliability Analysis: Perform survival curves and failure‑mode discovery on diagnostic flags.

Feature Engineering & Visualization: Explore PCA, correlation structures, and multivariate controls.

Licensing & Attribution

This synthetic dataset is released under the MIT License. For any publications, presentations, or derivative works, please cite:

Synthetic Distributed Adaptive BMS Dataset (v1.0), 2025.

Contributions and issue reports are welcome via the project’s GitHub repository.
f
Type, posture, and the number of observations in the IMU Posture dataset.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin (2023). Type, posture, and the number of observations in the IMU Posture dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0286311.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286311.t004
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Marinara Marcato; Salvatore Tedesco; Conor O’Mahony; Brendan O’Flynn; Paul Galvin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Type, posture, and the number of observations in the IMU Posture dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

Prevent before attack

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 10, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Dinesh Naveen Kumar Samudrala

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

network_packet_size (Packet Size in Bytes)
- Represents the size of network packets, ranging between 64 to 1500 bytes.
- Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.
- Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.
protocol_type (Communication Protocol)
- The protocol used in the session: TCP, UDP, or ICMP.
- TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).
- UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).
- ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.
encryption_used (Encryption Protocol)
- Values: AES, DES, None.
- AES (Advanced Encryption Standard): Strong encryption, commonly used.
- DES (Data Encryption Standard): Older encryption, weaker security.
- None: Indicates unencrypted communication, which can be risky.
- Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

login_attempts (Number of Logins)
- High values might indicate brute-force attacks (repeated login attempts).
- Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.
session_duration (Session Length in Seconds)
- A very long session might indicate unauthorized access or persistence by an attacker.
- Attackers may try to stay connected to maintain access.
failed_logins (Failed Login Attempts)
- High failed login counts indicate credential stuffing or dictionary attacks.
- Many failed attempts followed by a successful login could suggest an account was compromised.
unusual_time_access (Login Time Anomaly)
- A binary flag (0 or 1) indicating whether access happened at an unusual time.
- Attackers often operate outside normal business hours to evade detection.
ip_reputation_score (Trustworthiness of IP Address)
- A score from 0 to 1, where higher values indicate suspicious activity.
- IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.
browser_type (User’s Browser)
- Common browsers: Chrome, Firefox, Edge, Safari.
- Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (`attack_detected`)

Binary classification: 1 means an attack was detected, 0 means normal activity.
The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

Supervised Learning Approaches
- Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)
- Train the model using labeled data (attack_detected as the target).
- Evaluate using accuracy, precision, recall, F1-score.
Deep Learning Approaches
- Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.
- LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

Adversarial Attacks: Attackers may modify traffic to evade detection.
Concept Drift: Cyber threats...

Clear search

Close search

Google apps

Main menu

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

1. Understanding the Features

A. Network-Based Features

B. User Behavior-Based Features

2. Target Variable (attack_detected)

3. Possible Use Cases

A. Machine Learning-Based Intrusion Detection

B. Anomaly Detection (Unsupervised Learning)

C. Rule-Based Detection

4. Challenges & Considerations

SKAB - Skoltech Anomaly Benchmark

About SKAB

Datasets

Leaderboard (Scoreboard)

Outlier detection problem

Changepoint detection problem

Notebooks

Citat...

Categories of behaviour analysed.

Credit Card Fraud Detection

Number of observations after feature extraction per dataset per posture.

Primary Humid Forest

Battery Management System

Dataset Summary

Invitation for Feedback

Data Fields

Curation Rationale

Intended Use Cases

Licensing & Attribution

Type, posture, and the number of observations in the IMU Posture dataset.

Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

Prevent before attack

1. Understanding the Features

A. Network-Based Features

B. User Behavior-Based Features

2. Target Variable (attack_detected)

3. Possible Use Cases

A. Machine Learning-Based Intrusion Detection

B. Anomaly Detection (Unsupervised Learning)

C. Rule-Based Detection

4. Challenges & Considerations

2. Target Variable (`attack_detected`)

2. Target Variable (`attack_detected`)