4 datasets found
  1. P

    SOREL-20M Dataset

    • paperswithcode.com
    Updated Dec 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Harang; Ethan M. Rudd (2020). SOREL-20M Dataset [Dataset]. https://paperswithcode.com/dataset/sorel-20m
    Explore at:
    Dataset updated
    Dec 13, 2020
    Authors
    Richard Harang; Ethan M. Rudd
    Description

    SOREL-20M is a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional “tags” related to each malware sample to serve as additional targets.

  2. Sophos/ReversingLabs 20 Million malware detection dataset

    • registry.opendata.aws
    Updated Dec 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophos AI (2020). Sophos/ReversingLabs 20 Million malware detection dataset [Dataset]. https://registry.opendata.aws/sorel-20m/
    Explore at:
    Dataset updated
    Dec 18, 2020
    Dataset provided by
    Sophoshttp://sophos.com/
    Description

    A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.

  3. h

    sorel20m-100k

    • huggingface.co
    Updated Dec 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RevEngGrp25 (2020). sorel20m-100k [Dataset]. https://huggingface.co/datasets/reveng-grp-2025/sorel20m-100k
    Explore at:
    Dataset updated
    Dec 16, 2020
    Dataset authored and provided by
    RevEngGrp25
    Description

    SOREL-20M Subset Dataset

      Dataset Description
    

    This is a subset of the SOREL-20M dataset containing malware detection features and labels.

      Dataset Summary
    

    Total samples: 196534 Malicious samples: 99506 Benign samples: 97028 Feature dimensions: 2351 (EMBER v2 features)

      Dataset Structure
    

    The dataset contains the following columns:

    sha256: SHA256 hash of the original file label: Binary label (0=benign, 1=malicious) feature_0 to feature_2350: EMBER v2… See the full description on the dataset page: https://huggingface.co/datasets/reveng-grp-2025/sorel20m-100k.

  4. t

    Prithviraj Dasgupta, Zachariah Osman (2025). Dataset: SOREL 20M....

    • service.tib.eu
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Prithviraj Dasgupta, Zachariah Osman (2025). Dataset: SOREL 20M. https://doi.org/10.57702/7qjl9x2s [Dataset]. https://service.tib.eu/ldmservice/dataset/sorel-20m
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    Malware binary datasets used for adversarial malware generation research

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Richard Harang; Ethan M. Rudd (2020). SOREL-20M Dataset [Dataset]. https://paperswithcode.com/dataset/sorel-20m

SOREL-20M Dataset

Sophos/ReversingLabs-20 Million

Explore at:
Dataset updated
Dec 13, 2020
Authors
Richard Harang; Ethan M. Rudd
Description

SOREL-20M is a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional “tags” related to each malware sample to serve as additional targets.

Search
Clear search
Close search
Google apps
Main menu