As of 2025, nearly 63 percent of businesses worldwide were affected by ransomware attacks. This figure represents a decrease on the previous year and was by far the lowest figure reported since 2020. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries In 2024, the critical manufacturing industry in the United States was once again most targeted by ransomware attacks. Overall, organizations in this industry experienced 258 cyberattacks in the measured year. Healthcare and the public health sector ranked second, followed by government facilities, with 238 and 220 cyberattacks, respectively. Ransomware in the manufacturing industry The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2024, exploited vulnerabilities were the leading cause of ransomware attacks in the manufacturing industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ransomware, leveraging sophisticated encryption techniques, poses a significant threat by encrypting crucial data, thereby rendering it inaccessible. The proliferation of diverse ransomware variants has caused considerable harm to governments, corporations, and individual users alike. Despite the increasing prevalence of cyber threats, existing solutions often struggle with real-time detection and early identification of ransomware families. To address this challenge, we introduce FCG-RFD, a novel benchmark dataset featuring extensive Function Call Graphs (FCG) tailored for ransomware family detection. Given the constantly evolving nature of malware, antivirus scanners face ongoing challenges, necessitating access to recent and updated datasets. Our dataset comprises 8,095 samples sourced from reputable repositories including VirusSamples, Virusshare, VirusSign, the Zoo, and MalwareBazaar. Additionally, we include 8,020 normal files obtained from trusted sources such as the Microsoft Store and Softonic. Through FCG-RFD, we aim to facilitate more robust and timely detection of ransomware families, ultimately enhancing cybersecurity measures against this pervasive threat.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.
Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.
Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.
We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.
malware_repos.txt
Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."
Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.
Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.
obfuscated_github_user_dataset.csv
Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.
Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.
Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.
This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Android has become the most popular operating system on mobile devices, making it a prime target for threat actors in creating malware. The research conducted by the author aims to detect reverse TCP exploits in network traffic. The tools used are Metasploit for Android, Termux, PCAPdroid, Wireshark, OpenVPN, and Apktool in both terminal and application versions. The supporting devices for this research are hardware devices, namely a smartphone, VPS, Mikrotik Router, and laptop.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid and widespread increase of Android malware presents substantial obstacles to cybersecurity research. In order to revolutionize the field of malware research, we present the MH-1M dataset, which is a thorough compilation of 1,340,515 APK samples. This dataset encompasses a wide range of diverse attributes and metadata, offering a comprehensive perspective. The utilization of the VirusTotal API guarantees precise assessment of threats by amalgamating various detection techniques. Our research indicates that MH-1M is a highly current dataset that provides valuable insights into the changing nature of malware.MH-1M consists of 23,247 features that cover a wide range of application behavior, from intents::accept to apicalls::landroid/window/splashscreenview.remove. The features are categorized into four primary classifications:Feature TypesValuesAPICalls22,394Intents407OPCodes232Permissions214The dataset is stored efficiently, utilizing a memory capacity of 29.0 GB, which showcases its substantial yet controllable magnitude. The dataset consists of 1,221,421 benign applications and 119,094 malware applications, ensuring a balanced representation for accurate malware detection and analysis.The MH-1M repository also offers a wide variety of metadata from APKs, providing useful data into the development of malicious software over a period of more than ten years. The Android features include a wide variety of metadata, which includes SHA256 hashes, file names, package names, compilation APIs, and various other details. This GitHub repository contains over 400GB of valuable data, making it the largest and most comprehensive dataset available for advancing research and development in Android malware detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning-based intrusion detection requires suitable and realisticdata sets for training and testing. However, data sets that originate fromreal networks are rare. Network data is considered privacy sensitive and the purposeful introduction of malicious traffic is usually not possible. In thispaper we introduce a labeled data set captured at a smart factory locatedin Vienna, Austria during normal operation and during penetration tests with differentattack types. The data set contains 173 GB of PCAP files, which represent 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic was originatedby a professional penetration tester who performed two types of attacks: (a)aggressive attacks that are easier to detect and (b) stealthy attacks that areharder to detect. Our data set includes the raw PCAP files and extractedflow data. Labels for packets and flows indicate whether packets (or flows)originated from a specific attack or from benign communication. We describethe methodology for creating the data set, conduct an analysis of the dataand provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparabilityof results in the area of intrusion detection in industrial networks. File description: a_day1, a_day2, s_day1, s_day2, tf_a and tf_s: Main data set, where files starting with "tf" are training files containing only benign, operational data and all other files are attack files containing both, operational data and attack data. images.zip: Contains descriptive images about the data. extractions.zip: Contains extracted packets, flows in both labeled and unlabeled form. a_day_tuesday_dos.zip: additional day of attack traffic containing benign and attack data, including a DoS attack. This day is not labeled.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23
This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.
We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:
Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.
Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.
C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.
DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.
FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.
HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.
Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.
Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.
PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.
Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.
Field Name | Description | Type |
---|---|---|
ts | The timestamp of the connection event. | time |
uid | A unique identifier for the connection. | string |
id.orig_h | The source IP address. | addr |
id.orig_p | The source port. | port |
id.resp_h | The destination IP address. | addr |
id.resp_p | The destination port. | port |
proto | The network protocol used (e.g., 'tcp'). | enum |
service | The service associated with the connection. | string |
duration | The duration of the connection. | interval |
orig_bytes | The number of bytes sent from the source to the destination. | count |
resp_bytes | The number of bytes sent from the destination to the source. | count |
conn_state | The state of the connection. | string |
local_orig | Indicates whether the connection is considered local or not. | bool |
local_resp | Indicates whether the connection is considered... |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brenner, B., Fabini, J., Offermanns, M., Semper, S., & Zseby, T. (2024). Malware communication in smart factories: A network traffic data set. Computer Networks, 255, 110804. or in BibTeX: @article{brenner2024malware, title={Malware communication in smart factories: A network traffic data set}, author={Brenner, Bernhard and Fabini, Joachim and Offermanns, Magnus and Semper, Sabrina and Zseby, Tanja}, journal={Computer Networks}, volume={255}, pages={110804}, year={2024}, publisher={Elsevier}} Context and methodology Machine learning-based intrusion detection requires suitable and realistic data sets for training and testing. However, data sets that originate from real networks are rare. Network data is considered privacy-sensitive, and the purposeful introduction of malicious traffic is usually not possible. In this paper, we introduce a labeled data set captured at a smart factory located in Vienna, Austria, during normal operation and during penetration tests with different attack types. The data set contains 173 GB of PCAP files, representing 16 days (395 hours) of factory operation. It includes MQTT, OPC UA, and Modbus/TCP traffic. The captured malicious traffic originated from a professional penetration tester who performed two types of attacks:(a) Aggressive attacks that are easier to detect.(b) Stealthy attacks that are harder to detect. Our data set includes the raw PCAP files and extracted flow data. Labels for packets and flows indicate whether they originated from a specific attack or from benign communication. We describe the methodology for creating the dataset, conduct an analysis of the data, and provide detailed information about the recorded traffic itself. The dataset is freely available to support reproducible research and the comparability of results in the area of intrusion detection in industrial networks. Technical details readme.txt Information about the data collection, format, necessary software and versions to access it.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
WinMET dataset contains the execution traces generated with CAPE sandbox after analyzing several malware samples. The execution traces are valid JSON files that contain the spawned processes, the sequence of WinAPI and system calls invoked by each process, their parameters, their return values, and OS accesed resources, amongst many others.
Please use this DOI reference that always points to the latest WinMET version: https://doi.org/10.5281/zenodo.12647555" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.12647555
All 7z files are password protected. The password is: infected
.
.json
files), split into 5 volumes.bf51181eafc8452090bb6ce9f47b6714
aee86b4591a46c69b0d027de80ff1011
996723774909bc2e6745382697317460
4f5acbabeb9d24c96dadef71f56bd916
b3d15c97990dd0dfb0d94e369f486025
cape_report_to_label_mapping.json
contains the mappings of each report with its corresponding label as assigned by the CAPE sandbox labeling algorithm, sorted in descendant order (given the number of reports belonging to each label/family).avclass_report_to_label_mapping.json
contains the mappings of each report with its corresponding label as assigned by AVClass, sorted in descendant order (given the number of reports belonging to each label/family).reports_consensus_label.json
contains both labels (CAPE and AVClass) for each execution trace.If you use this dataset, cite it as follows:
Raducu, R., Villagrasa-Labrador, A., Rodríguez, R. J., & Álvarez, P. (2025). WinMET Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12647555
BibTex:
@misc{WinMET_dataset,
author = {Raducu, Razvan and Villagrasa‑Labrador, Alain and Rodríguez, Ricardo J. and Álvarez, Pedro},
title = {{WinMET Dataset: Windows Malware Execution Traces}},
howpublished = {Zenodo, dataset},
year = {2025},
doi = {10.5281/zenodo.12647555},
url = {https://doi.org/10.5281/zenodo.12647555}
}
This dataset was generated using the https://github.com/reverseame/MALVADA/tree/main" target="_blank" rel="noopener">MALVADA framework, which you can read more about in our publication https://doi.org/10.1016/j.softx.2025.102082" target="_blank" rel="noopener">https://doi.org/10.1016/j.softx.2025.102082. The article also provides insights about the contents of this dataset.
Razvan Raducu, Alain Villagrasa-Labrador, Ricardo J. Rodríguez, Pedro Álvarez, MALVADA: A framework for generating datasets of malware execution traces, SoftwareX, Volume 30, 2025, 102082, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2025.102082. (https://www.sciencedirect.com/science/article/pii/S2352711025000494)
The following statistic (and many more) can be obtained by analyzing the WinMET dataset with the https://github.com/reverseame/MALVADA" target="_blank" rel="noopener">MALVADA framework.
* The execution traces with no label are assigned the "(n/a)" family. We ommited it here.
reports_consensus_label.json
.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
* MalOSS: subset of malicious packages from MalOSS dataset [RQ1, RQ2, RQ4]
* BackStabber: subset of malicious packages from BackStabber Knife's Collection [RQ1, RQ2, RQ4]
* MalRegistry: subset of malicious packages from Python MalRegistry dataset [RQ1, RQ2, RQ4]
* Popular: a collection of top-100 most popular python packages from PyPI [RQ1, RQ2, RQ3, RQ4]
* Trusted: a collection of packages from trusted organizations hosted in PyPI [RQ1, RQ2, RQ3, RQ4]
* DataKund: a collection of newly identified malicious packages from PyPI [Case Study]
* Recent: a collection of packages that were recently (2024 Oct) added to PyPI [Macaron Case Study]
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Database Security Audits Services market is experiencing robust growth, driven by the increasing reliance on databases across various industries and the escalating threat landscape. The market's expansion is fueled by several key factors. Firstly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to prioritize database security and conduct regular audits to ensure compliance. Secondly, the rising frequency and sophistication of cyberattacks targeting databases, including ransomware and data breaches, are prompting proactive security measures, including comprehensive audits. Thirdly, the shift towards cloud-based databases introduces new security challenges and necessitates specialized audit services to address vulnerabilities inherent in cloud environments. The market is segmented by application (Financial, Medical, Telecom, Government, Manufacturing, Others) and type (Cloud-based, On-premise), with cloud-based services witnessing faster adoption due to the expanding cloud computing market. North America and Europe currently hold significant market share, but regions like Asia-Pacific are exhibiting rapid growth potential owing to increasing digitalization and adoption of advanced technologies. Major players are investing in innovative solutions and expanding their service portfolios to cater to diverse client needs, fostering competition and driving market evolution. While the market faces restraints like high implementation costs and a shortage of skilled professionals, the overall growth trajectory remains positive, propelled by the escalating demand for robust database security and compliance. The forecast period (2025-2033) anticipates continued expansion, potentially exceeding a compound annual growth rate (CAGR) of 15%. This optimistic projection is based on several factors. First, the ongoing digital transformation across industries will lead to increased reliance on databases and subsequently, heightened demand for security audits. Second, the continuous evolution of cyber threats will necessitate more frequent and comprehensive audits, further boosting market growth. Thirdly, the market will benefit from technological advancements in database security tools and methodologies, enabling more efficient and effective audits. However, challenges remain, particularly in addressing the skill gap and ensuring the affordability of these services for smaller organizations. Nevertheless, the long-term outlook for the Database Security Audits Services market remains strongly positive, with significant opportunities for market expansion and innovation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COSOCO: Compromised Software Containers Image Dataset
Paper: Malware Detection in Docker Containers: An Image is Worth a Thousand Logs Dataset Documentation: COSOCO Dataset Documentation
Dataset Description
COSOCO (Compromised Software Containers) is a synthetic dataset of 3364 images representing benign and malware-compromised software containers. Each image in the dataset represents a dockerized software container that has been converted to an image using common… See the full description on the dataset page: https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset.
These datasets (DREBIN and AndroZoo) are contributions to the paper "Fast & Furious: On the Modelling of Malware Detection as an Evolving Data Stream". If you use them in your work, please cite our paper using the BibTeX below:
@article{CESCHIN2022118590,
title = {Fast & Furious: On the modelling of malware detection as an evolving data stream},
journal = {Expert Systems with Applications},
pages = {118590},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.118590},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016463},
author = {Fabrício Ceschin and Marcus Botacin and Heitor Murilo Gomes and Felipe Pinagé and Luiz S. Oliveira and André Grégio},
keywords = {Machine learning, Data streams, Concept drift, Malware detection, Android}
}
Both datasets are saved in the parquet file format. To read them, use the following code:
data_drebin = pd.read_parquet("drebin_drift.parquet.zip")
data_androzoo = pd.read_parquet("androbin.parquet.zip")
Note that these datasets are different from their original versions. The original DREBIN dataset does not contain the samples' timestamps, which we collected using VirusTotal API. Our version of the AndroZoo dataset is a subset of reports from their dataset previously available in their APK Analysis API, which was discontinued.
The DREBIN dataset is composed of ten textual attributes from Android APKs (list of API calls, permissions, URLs, etc), which are publicly available to download and contain 123,453 benign and 5,560 malicious Android applications. Their distribution over time is shown below.
https://i.imgur.com/IGKOMtE.png" alt="DREBIN dataset distribution by month">
The AndroZoo dataset is a subset of Android applications reports provided by AndroZoo API, composed of eight textual attributes (resources names, source code classes and methods, manifest permissions etc.) and contains 213,928 benign and 70,340 malicious applications. The distribution over time of our AndroZoo subset, which keeps the same goodware and malware distribution as the original dataset (composed of most of 10 million apps), is shown below.
https://i.imgur.com/8zxH3M4.png" alt="AndroZoo dataset distribution by month">
The source code for all the experiments shown in the paper are also available here on Kaggle (note that the experiments using AndroZoo dataset did not run in the Kaggle environment due to high memory usage).
Experiment 1 (The Best-Case Scenario for AVs - ML Cross-Validation)
Here we classify all samples together to compare which feature extraction algorithm is the best and report baseline results. We tested several parameters for both algorithms and fixed the vocabulary size at 100 for TF-IDF (top-100 features ordered by term frequency) and created projections with 100 dimensions for Word2Vec, resulting in 1, 000 and 800 features for each app in both cases, for DREBIN and AndroZoo, respectively. All results are reported after 10-fold cross-validation procedures, a method commonly used in ML to evaluate models because its results are less prone to biases (note that we are training new classifiers and feature extractors at every iteration of the cross-validation process). . In practice, folding the dataset implies that the AV company has a mixed view of both past and future threats, despite temporal effects, which is the best scenario for AV operation and ML evaluation.
Source Codes: DREBIN TFIDF | DREBIN W2V | ANDROZOO TFIDF | ANDROZOO W2V
Experiment 2 (On Classification Failure - Temporal Classification)
Although the currently used classification methodology helps reduci dataset biases, it would demand knowledge about future threats to work properly. AV companies train their classifiers using data from past samples and leverage them to predict future threats, expecting to present the same characteristics as past ones. However, malware samples are very dynamic, thus this strategy is the worst-case scenario for AV companies. To demonstrate the effects of predicting future threats based on past data, we split our datasets in two: we used the first half (oldest samples) to train our classifiers, which were then used to predict the newest samples from the second half. The results in the paper indicate a drop in all metrics when compared to the 10-fold experiment in both DREBIN and AndroZoo dataset...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Android is the most popular operating system of the latest mobile smart devices. With this operating system, many Android applications have been developed and become an essential part of our daily lives. Unfortunately, different kinds of Android malware have also been generated with these applications’ endless stream and somehow installed during the API calls, permission granted and extra packages installation and badly affected the system security rules to harm the system. Therefore, it is compulsory to detect and classify the android malware to save the user’s privacy to avoid maximum damages. Many research has already been developed on the different techniques related to android malware detection and classification. In this work, we present AMDDLmodel a deep learning technique that consists of a convolutional neural network. This model works based on different parameters, filter sizes, number of epochs, learning rates, and layers to detect and classify the android malware. The Drebin dataset consisting of 215 features was used for this model evaluation. The model shows an accuracy value of 99.92%. The other statistical values are precision, recall, and F1-score. AMDDLmodel introduces innovative deep learning for Android malware detection, enhancing accuracy and practical user security through inventive feature engineering and comprehensive performance evaluation. The AMDDLmodel shows the highest accuracy values as compared to the existing techniques.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
State-of-the-art comparison with the existing techniques.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
As of 2025, nearly 63 percent of businesses worldwide were affected by ransomware attacks. This figure represents a decrease on the previous year and was by far the lowest figure reported since 2020. Overall, since 2018, more than half of the total survey respondents each year stated that their organizations had been victimized by ransomware. Most targeted industries In 2024, the critical manufacturing industry in the United States was once again most targeted by ransomware attacks. Overall, organizations in this industry experienced 258 cyberattacks in the measured year. Healthcare and the public health sector ranked second, followed by government facilities, with 238 and 220 cyberattacks, respectively. Ransomware in the manufacturing industry The manufacturing industry, along with its subindustries, is constantly targeted by ransomware attacks, causing data loss, business disruptions, and reputational damage. Often, such cyberattacks are international and have a political intent. In 2024, exploited vulnerabilities were the leading cause of ransomware attacks in the manufacturing industry.