2 datasets found

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

zenodo.org

json

Updated Dec 10, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák (2024). A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024 [Dataset]. http://doi.org/10.5281/zenodo.13330074

Explore at:

jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13330074

Dataset updated

Dec 10, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Aug 16, 2024

Description

The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS handshakes and certificates, and GeoIP information for 368,956 benign domains from Cisco Umbrella, 461,338 benign domains from the actual CESNET network traffic, 164,425 phishing domains from PhishTank and OpenPhish services, and 100,809 malware domains from various sources like ThreatFox, The Firebog, MISP threat intelligence platform, and other sources. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered malicious by VT have been removed from phishing and malware datasets. Similarly, benign domain names that were considered risky by VT have been removed from the benign datasets. The data was collected between March 2023 and July 2024. The final assessment of the data was conducted in August 2024.

The dataset is useful for cybersecurity research, e.g. statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing and malware website detection.

Data Files

The data is located in the following individual files:
- benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
- benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
- phishing.json - data for 164,425 phishing domains, and
- malware.json - data for 100,809 malware domains.

Data Structure

Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

some fields may be missing (they should be interpreted as nulls),
extra fields may be present (they should be ignored).

Field name	Field type	Nullable	Description
domain_name	String	No	The evaluated domain name
url	String	No	The source URL for the domain name
evaluated_on	Date	No	Date of last collection attempt
source	String	No	An identifier of the source
sourced_on	Date	No	Date of ingestion of the domain name
dns	Object	Yes	Data from DNS scan
rdap	Object	Yes	Data from RDAP or WHOIS
tls	Object	Yes	Data from TLS handshake
ip_data	Array of Objects	Yes	Array of data objects capturing the IP addresses related to the domain name
DNS data (dns field)
A	Array of Strings	No	Array of IPv4 addresses
AAAA	Array of Strings	No	Array of IPv6 addresses
TXT	Array of Strings	No	Array of raw TXT values
CNAME	Object	No	The CNAME target and related IPs
MX	Array of Objects	No	Array of objects with the MX target hostname, priority and related IPs
NS	Array of Objects	No	Array of objects with the NS target hostname and related IPs
SOA	Object	No	All the SOA fields, present if found at the target domain name
zone_SOA	Object	No	The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly
dnssec	Object	No	Flags describing the DNSSEC validation result for each record type
ttls	Object	No	The TTL values for each record type
remarks	Object	No	The zone domain name and DNSSEC flags
RDAP data (rdap field)
copyright_notice	String	No	RDAP/WHOIS data usage copyright notice
dnssec	Bool	No	DNSSEC presence flag
entitites	Object	No	An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.
expiration_date	Date	Yes	The current date of expiration
handle	String	No	RDAP handle
last_changed_date	Date	Yes	The date when the domain was last changed
name	String	No	The target domain name for which the data in this object are stored
nameservers	Array of Strings	No	Nameserver hostnames provided by RDAP or WHOIS
registration_date	Date	Yes	First registration date
status	Array of Strings

f
Learning rate—Performance comparison with Phishtank dataset.
figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashit Kumar Dutta (2023). Learning rate—Performance comparison with Phishtank dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0258361.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0258361.t003
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Ashit Kumar Dutta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Learning rate—Performance comparison with Phishtank dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024

Explore at:

jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13330074

Dataset updated

Dec 10, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Radek Hranický; Radek Hranický; Adam Horák; Ondřej Ondryáš; Ondřej Ondryáš; Adam Horák

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Aug 16, 2024

Description

Data Files

The data is located in the following individual files:
- benign_umbrella.json - data for 368,956 benign domains from Cisco Umbrella,
- benign_cesnet.json - data for 461,338 benign domains from the CESNET network,
- phishing.json - data for 164,425 phishing domains, and
- malware.json - data for 100,809 malware domains.

Data Structure

Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

some fields may be missing (they should be interpreted as nulls),
extra fields may be present (they should be ignored).

Field name	Field type	Nullable	Description
domain_name	String	No	The evaluated domain name
url	String	No	The source URL for the domain name
evaluated_on	Date	No	Date of last collection attempt
source	String	No	An identifier of the source
sourced_on	Date	No	Date of ingestion of the domain name
dns	Object	Yes	Data from DNS scan
rdap	Object	Yes	Data from RDAP or WHOIS
tls	Object	Yes	Data from TLS handshake
ip_data	Array of Objects	Yes	Array of data objects capturing the IP addresses related to the domain name
DNS data (dns field)
A	Array of Strings	No	Array of IPv4 addresses
AAAA	Array of Strings	No	Array of IPv6 addresses
TXT	Array of Strings	No	Array of raw TXT values
CNAME	Object	No	The CNAME target and related IPs
MX	Array of Objects	No	Array of objects with the MX target hostname, priority and related IPs
NS	Array of Objects	No	Array of objects with the NS target hostname and related IPs
SOA	Object	No	All the SOA fields, present if found at the target domain name
zone_SOA	Object	No	The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly
dnssec	Object	No	Flags describing the DNSSEC validation result for each record type
ttls	Object	No	The TTL values for each record type
remarks	Object	No	The zone domain name and DNSSEC flags
RDAP data (rdap field)
copyright_notice	String	No	RDAP/WHOIS data usage copyright notice
dnssec	Bool	No	DNSSEC presence flag
entitites	Object	No	An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities.
expiration_date	Date	Yes	The current date of expiration
handle	String	No	RDAP handle
last_changed_date	Date	Yes	The date when the domain was last changed
name	String	No	The target domain name for which the data in this object are stored
nameservers	Array of Strings	No	Nameserver hostnames provided by RDAP or WHOIS
registration_date	Date	Yes	First registration date
status	Array of Strings

Clear search

Close search

Google apps

Main menu

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large...

Data Files

Data Structure

Learning rate—Performance comparison with Phishtank dataset.

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024See More Versions

Data Files

Data Structure

A Dataset of Information (DNS, IP, WHOIS/RDAP, TLS, GeoIP) for a Large Corpus of Benign, Phishing, and Malware Domain Names 2024