2 datasets found
  1. A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

    • zenodo.org
    json
    Updated Dec 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 16, 2024
    Description

    The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

    The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

    Data Files

    • The data is located in the following individual files:

      • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
      • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
      • phishing.json - data for 164,425 phishing domains, and
      • malware.json - data for 100,809 malware domains.

    Data Structure

    Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

    • some fields may be missing (they should be interpreted as nulls),
    • extra fields may be present (they should be ignored).

    Field name

    Field type

    Nullable

    Description

    domain_name

    String

    No

    The evaluated domain name

    url

    String

    No

    The source URL for the domain name

    evaluated_on

    Date

    No

    Date of last collection attempt

    source

    String

    No

    An identifier of the source

    sourced_on

    Date

    No

    Date of ingestion of the domain name

    dns

    Object

    Yes

    Data from DNS scan

    rdap

    Object

    Yes

    Data from RDAP or WHOIS

    tls

    Object

    Yes

    Data from TLS handshake

    ip_data

    Array of Objects

    Yes

    Array of data objects capturing the IP addresses related to the domain name

    DNS data (dns field)

    A

    Array of Strings

    No

    Array of IPv4 addresses

    AAAA

    Array of Strings

    No

    Array of IPv6 addresses

    TXT

    Array of Strings

    No

    Array of raw TXT values

    CNAME

    Object

    No

    The CNAME target and related IPs

    MX

    Array of Objects

    No

    Array of objects with the MX target hostname, priority and related IPs

    NS

    Array of Objects

    No

    Array of objects with the NS target hostname and related IPs

    SOA

    Object

    No

    All the SOA fields, present if found at the target domain name

    zone_SOA

    Object

    No

    The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

    dnssec

    Object

    No

    Flags describing the DNSSEC validation result for each record type

    ttls

    Object

    No

    The TTL values for each record type

    remarks

    Object

    No

    The zone domain name and DNSSEC flags

    RDAP data (rdap field)

    copyright_notice

    String

    No

    RDAP/WHOIS data usage copyright notice

    dnssec

    Bool

    No

    DNSSEC presence flag

    entitites

    Object

    No

    An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

    expiration_date

    Date

    Yes

    The current date of expiration

    handle

    String

    No

    RDAP handle

    last_changed_date

    Date

    Yes

    The date when the domain was last changed

    name

    String

    No

    The target domain name for which the data in this object are stored

    nameservers

    Array of Strings

    No

    Nameserver hostnames provided by RDAP or WHOIS

    registration_date

    Date

    Yes

    First registration date

    status

    Array of Strings

  2. f

    Learning rate—Performance comparison with Phishtank dataset.

    • figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashit Kumar Dutta (2023). Learning rate—Performance comparison with Phishtank dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0258361.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ashit Kumar Dutta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Learning rate—Performance comparison with Phishtank dataset.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074
Organization logo

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024

Explore at:
jsonAvailable download formats
Dataset updated
Dec 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Aug 16, 2024
Description

The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

Data Files

  • The data is located in the following individual files:

    • benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
    • benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
    • phishing.json - data for 164,425 phishing domains, and
    • malware.json - data for 100,809 malware domains.

Data Structure

Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

  • some fields may be missing (they should be interpreted as nulls),
  • extra fields may be present (they should be ignored).

Field name

Field type

Nullable

Description

domain_name

String

No

The evaluated domain name

url

String

No

The source URL for the domain name

evaluated_on

Date

No

Date of last collection attempt

source

String

No

An identifier of the source

sourced_on

Date

No

Date of ingestion of the domain name

dns

Object

Yes

Data from DNS scan

rdap

Object

Yes

Data from RDAP or WHOIS

tls

Object

Yes

Data from TLS handshake

ip_data

Array of Objects

Yes

Array of data objects capturing the IP addresses related to the domain name

DNS data (dns field)

A

Array of Strings

No

Array of IPv4 addresses

AAAA

Array of Strings

No

Array of IPv6 addresses

TXT

Array of Strings

No

Array of raw TXT values

CNAME

Object

No

The CNAME target and related IPs

MX

Array of Objects

No

Array of objects with the MX target hostname, priority and related IPs

NS

Array of Objects

No

Array of objects with the NS target hostname and related IPs

SOA

Object

No

All the SOA fields, present if found at the target domain name

zone_SOA

Object

No

The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly

dnssec

Object

No

Flags describing the DNSSEC validation result for each record type

ttls

Object

No

The TTL values for each record type

remarks

Object

No

The zone domain name and DNSSEC flags

RDAP data (rdap field)

copyright_notice

String

No

RDAP/WHOIS data usage copyright notice

dnssec

Bool

No

DNSSEC presence flag

entitites

Object

No

An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.

expiration_date

Date

Yes

The current date of expiration

handle

String

No

RDAP handle

last_changed_date

Date

Yes

The date when the domain was last changed

name

String

No

The target domain name for which the data in this object are stored

nameservers

Array of Strings

No

Nameserver hostnames provided by RDAP or WHOIS

registration_date

Date

Yes

First registration date

status

Array of Strings

Search
Clear search
Close search
Google apps
Main menu