SOREL-20M is a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional “tags” related to each malware sample to serve as additional targets.
A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.
SOREL-20M Subset Dataset
Dataset Description
This is a subset of the SOREL-20M dataset containing malware detection features and labels.
Dataset Summary
Total samples: 196534 Malicious samples: 99506 Benign samples: 97028 Feature dimensions: 2351 (EMBER v2 features)
Dataset Structure
The dataset contains the following columns:
sha256: SHA256 hash of the original file label: Binary label (0=benign, 1=malicious) feature_0 to feature_2350: EMBER v2… See the full description on the dataset page: https://huggingface.co/datasets/reveng-grp-2025/sorel20m-100k.
Malware binary datasets used for adversarial malware generation research
Not seeing a result you expected?
Learn how you can add new datasets to our index.
SOREL-20M is a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional “tags” related to each malware sample to serve as additional targets.