Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.
The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.
The data is located in the following individual files:
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
Field name |
Field type |
Nullable |
Description |
domain_name |
String |
No |
The evaluated domain name |
url |
String |
No |
The source URL for the domain name |
evaluated_on |
Date |
No |
Date of last collection attempt |
source |
String |
No |
An identifier of the source |
sourced_on |
Date |
No |
Date of ingestion of the domain name |
dns |
Object |
Yes |
Data from DNS scan |
rdap |
Object |
Yes |
Data from RDAP or WHOIS |
tls |
Object |
Yes |
Data from TLS handshake |
ip_data |
Array of Objects |
Yes |
Array of data objects capturing the IP addresses related to the domain name |
DNS data (dns field) | |||
A |
Array of Strings |
No |
Array of IPv4 addresses |
AAAA |
Array of Strings |
No |
Array of IPv6 addresses |
TXT |
Array of Strings |
No |
Array of raw TXT values |
CNAME |
Object |
No |
The CNAME target and related IPs |
MX |
Array of Objects |
No |
Array of objects with the MX target hostname, priority and related IPs |
NS |
Array of Objects |
No |
Array of objects with the NS target hostname and related IPs |
SOA |
Object |
No |
All the SOA fields, present if found at the target domain name |
zone_SOA |
Object |
No |
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly |
dnssec |
Object |
No |
Flags describing the DNSSEC validation result for each record type |
ttls |
Object |
No |
The TTL values for each record type |
remarks |
Object |
No |
The zone domain name and DNSSEC flags |
RDAP data (rdap field) | |||
copyright_notice |
String |
No |
RDAP/WHOIS data usage copyright notice |
dnssec |
Bool |
No |
DNSSEC presence flag |
entitites |
Object |
No |
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. |
expiration_date |
Date |
Yes |
The current date of expiration |
handle |
String |
No |
RDAP handle |
last_changed_date |
Date |
Yes |
The date when the domain was last changed |
name |
String |
No |
The target domain name for which the data in this object are stored |
nameservers |
Array of Strings |
No |
Nameserver hostnames provided by RDAP or WHOIS |
registration_date |
Date |
Yes |
First registration date |
status |
Array of Strings |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Learning rate—Performance comparison with Phishtank dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.
The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.
The data is located in the following individual files:
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
Field name |
Field type |
Nullable |
Description |
domain_name |
String |
No |
The evaluated domain name |
url |
String |
No |
The source URL for the domain name |
evaluated_on |
Date |
No |
Date of last collection attempt |
source |
String |
No |
An identifier of the source |
sourced_on |
Date |
No |
Date of ingestion of the domain name |
dns |
Object |
Yes |
Data from DNS scan |
rdap |
Object |
Yes |
Data from RDAP or WHOIS |
tls |
Object |
Yes |
Data from TLS handshake |
ip_data |
Array of Objects |
Yes |
Array of data objects capturing the IP addresses related to the domain name |
DNS data (dns field) | |||
A |
Array of Strings |
No |
Array of IPv4 addresses |
AAAA |
Array of Strings |
No |
Array of IPv6 addresses |
TXT |
Array of Strings |
No |
Array of raw TXT values |
CNAME |
Object |
No |
The CNAME target and related IPs |
MX |
Array of Objects |
No |
Array of objects with the MX target hostname, priority and related IPs |
NS |
Array of Objects |
No |
Array of objects with the NS target hostname and related IPs |
SOA |
Object |
No |
All the SOA fields, present if found at the target domain name |
zone_SOA |
Object |
No |
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly |
dnssec |
Object |
No |
Flags describing the DNSSEC validation result for each record type |
ttls |
Object |
No |
The TTL values for each record type |
remarks |
Object |
No |
The zone domain name and DNSSEC flags |
RDAP data (rdap field) | |||
copyright_notice |
String |
No |
RDAP/WHOIS data usage copyright notice |
dnssec |
Bool |
No |
DNSSEC presence flag |
entitites |
Object |
No |
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. |
expiration_date |
Date |
Yes |
The current date of expiration |
handle |
String |
No |
RDAP handle |
last_changed_date |
Date |
Yes |
The date when the domain was last changed |
name |
String |
No |
The target domain name for which the data in this object are stored |
nameservers |
Array of Strings |
No |
Nameserver hostnames provided by RDAP or WHOIS |
registration_date |
Date |
Yes |
First registration date |
status |
Array of Strings |