62 datasets found

jdb-url.com - Historical whois Lookup
whoisdatacenter.com
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc, jdb-url.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/jdb-url.com/
Explore at:
csvAvailable download formats
Dataset provided by
AllHeart Web
Authors
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Aug 1, 2025
Description
Explore the historical Whois records related to jdb-url.com (Domain). Get insights into ownership history and changes over time.
free-tiny-url.com - Historical whois Lookup
whoisdatacenter.com
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc, free-tiny-url.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/free-tiny-url.com/
Explore at:
csvAvailable download formats
Dataset provided by
AllHeart Web
Authors
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Jul 11, 2025
Description
Explore the historical Whois records related to free-tiny-url.com (Domain). Get insights into ownership history and changes over time.
h
url-classifications
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
snats, url-classifications [Dataset]. https://huggingface.co/datasets/snats/url-classifications
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
snats
Description
Model Card: URL Classifications Dataset

Dataset Summary

The URL Classifications Dataset is a collection of URL classifications for PDF documents, primarily derived from the SafeDocs corpus. It contains multiple CSV files with different subsets of classifications, including both raw and processed data.

Supported Tasks

This dataset supports the following tasks:

Text Classification URL-based Document Classification PDF Content Inference

Languages

The… See the full description on the dataset page: https://huggingface.co/datasets/snats/url-classifications.
Phishing websites
kaggle.com
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satya Ganesh Kumar (2023). Phishing websites [Dataset]. https://www.kaggle.com/datasets/satyaganeshkumar/phishing-websites
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Satya Ganesh Kumar
Description
The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.
the-duke-of-url.com - Historical whois Lookup
whoisdatacenter.com
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc, the-duke-of-url.com - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/the-duke-of-url.com/
Explore at:
csvAvailable download formats
Dataset provided by
AllHeart Web
Authors
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Jul 13, 2025
Description
Explore the historical Whois records related to the-duke-of-url.com (Domain). Get insights into ownership history and changes over time.
C
City website pages (by URL) visited in 2013
processor1.francecentral.cloudapp.azure.com
Updated Oct 29, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Brussels/Web Cell (2014). City website pages (by URL) visited in 2013 [Dataset]. http://processor1.francecentral.cloudapp.azure.com/dataset/pages-of-the-website-by-url-of-the-city-visited-in-2013
Explore at:
https://www.iana.org/assignments/media-types/application/json, https://www.iana.org/assignments/media-types/text/csvAvailable download formats
Dataset updated
Oct 29, 2014
Dataset provided by
City of Brussels/Web Cell
License
Licence Ouverte / Open Licence 1.0https://www.etalab.gouv.fr/wp-content/uploads/2014/05/Open_Licence.pdf
License information was derived automatically
Description
Statistics of visits to the pages of the website of the City of Brussels with the address of the page (according to www.bruxelles.be, the number of pages viewed, the number of unique consultations, the average time spent on the page. Source: Google Analytics.Each page of the website of the City of Brussels is identified by 4 figures appearing in its address.A search on these 4 figures makes it possible to obtain the statistics relating to the page in question.
C
Visited web pages of the City (per URL) in 2015
processor1.francecentral.cloudapp.azure.com
Updated Feb 5, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Brussels/Cell Web (2016). Visited web pages of the City (per URL) in 2015 [Dataset]. http://processor1.francecentral.cloudapp.azure.com/dataset/visited-webpages-of-the-city-by-url-in-2015
Explore at:
https://www.iana.org/assignments/media-types/text/csv, https://www.iana.org/assignments/media-types/application/jsonAvailable download formats
Dataset updated
Feb 5, 2016
Dataset provided by
City of Brussels/Cell Web
License
Licence Ouverte / Open Licence 1.0https://www.etalab.gouv.fr/wp-content/uploads/2014/05/Open_Licence.pdf
License information was derived automatically
Description
Visitor statistics of the pages of the website of the City of Brussels (2014) with the address of the page (following www.brussels.be), the number of pages viewed, the number of unique visits. Source: Google Analytics.
h
phishing-url
huggingface.co
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pirocheto (2025). phishing-url [Dataset]. https://huggingface.co/datasets/pirocheto/phishing-url
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Authors
pirocheto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

The provided dataset includes 11430 URLs with 87 extracted features.The dataset are designed to be used as a benchmark for machine learning based phishing detection systems.The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs.
Features are from three different classes:

56 extracted from the structure and syntax of URLs 24 extracted from the content of their correspondent pages 7 are extracetd by querying external services.

The… See the full description on the dataset page: https://huggingface.co/datasets/pirocheto/phishing-url.
URL classification
kaggle.com
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sainath Krothapalli (2020). URL classification [Dataset]. https://www.kaggle.com/datasets/sainathkrothapalli/url-classification/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sainath Krothapalli
Description
This is a URL classification problem.Dataset consists of 17 features,out of it 16 are independent and statistical_report is dependent feature.Here in statistical report 0-Benign (good) and 1-malicious(bad).
h
phishing-site-url
huggingface.co
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaushi Gihan (2025). phishing-site-url [Dataset]. https://huggingface.co/datasets/KaushiGihan/phishing-site-url
Explore at:
Dataset updated
May 7, 2025
Authors
Kaushi Gihan
Description
Datset refference link: https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls/data
m
PhiUSIIL Phishing URL Dataset
data.mendeley.com
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Prasad (2023). PhiUSIIL Phishing URL Dataset [Dataset]. http://doi.org/10.17632/shwpxscxy2.2
Explore at:
Unique identifier
https://doi.org/10.17632/shwpxscxy2.2
Dataset updated
Nov 15, 2023
Authors
Arvind Prasad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.

Citation: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545
Random sample of Common Crawl domains from 2021
kaggle.com
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HiHarshSinghal
Description
Context

Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

Content

I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

Acknowledgements

Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

Inspiration

My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

I am also interested in identifying fraudulent domains and understanding malicious URL patterns.
w
weychi@ms12.url.com.tw - Reverse Whois Lookup
whoisdatacenter.com
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc, weychi@ms12.url.com.tw - Reverse Whois Lookup [Dataset]. https://whoisdatacenter.com/email/weychi@ms12.url.com.tw/
Explore at:
csvAvailable download formats
Dataset authored and provided by
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Jun 18, 2025
Description
Explore historical ownership and registration records by performing a reverse Whois lookup for the email address weychi@ms12.url.com.tw..
w
Websites using URL Rewrite
webtechsurvey.com
csv
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey (2021). Websites using URL Rewrite [Dataset]. https://webtechsurvey.com/technology/url-rewrite
Explore at:
csvAvailable download formats
Dataset updated
Dec 1, 2021
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the URL Rewrite technology, compiled through global website indexing conducted by WebTechSurvey.
d
Ministry of Justice and its affiliated agencies mobile website address...
data.gov.tw
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Information Management, Ministry of Justice and its affiliated agencies mobile website address information [Dataset]. https://data.gov.tw/en/datasets/13001
Explore at:
csvAvailable download formats
Dataset authored and provided by
Department of Information Management
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
Mobile Website URL Information on the World Wide Web
n
Repository Analytics and Metrics Portal (RAMP) 2017 data
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2017 data [Dataset]. http://doi.org/10.5061/dryad.r7sqv9scf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r7sqv9scf
Dataset updated
Jul 27, 2021
Dataset provided by
Montana State University
University of New Mexico
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
e
URLs (url) for government agencies and municipalities
data.europa.eu
unknown
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). URLs (url) for government agencies and municipalities [Dataset]. https://data.europa.eu/data/datasets/https-data-norge-no-node-584
Explore at:
unknownAvailable download formats
Dataset updated
Feb 5, 2022
License
https://data.norge.no/nlod/en/2.0/https://data.norge.no/nlod/en/2.0/
Description
The dataset contains Internet addresses for government agencies and municipalities. It is intended to be used together with the data set of the units of public administration. This dataset is part of several data sets about public enterprises. The data sets are referred to as the agency base and were previously on Norge.no. They contain an overview of public enterprises, i.e. government agencies and enterprises’ central, regional and local units, county municipalities and municipalities. Data sets are not updated. The data sets contain information about the name of the enterprise, visiting address, postal address, telephone number, e-mail address, web address (URL), map coordinates (position), coverage (which municipalities the business covers), organisation number, overarching activity, type of organisation, type of affiliation (the way in which an enterprise is linked to the executive government) and quality assessments of the website. Look up on the keyword/tag agency base to see the other datasets. The establishment base is closed and is no longer maintained by the Directorate of Digitalisation (formerly Difi). The datasets were last updated in January 2012. Note that this does not mean that all data was updated in January 2012, but that the last changes were made at that time. Reference to the source When using this dataset, we ask that the source be referred to as follows (cf the NLOD license): The service is based on open data sets from the Directorate of Digitalisation and is subject to the Norwegian License for Public Data (NLOD). The data was last updated in 2012 and is no longer maintained by the Directorate of Digitalisation.
Common industries for malicious URL redirections South Korea H2 2021
statista.com
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common industries for malicious URL redirections South Korea H2 2021 [Dataset]. https://www.statista.com/statistics/1311491/south-korea-malicious-url-redirect-common-industries/
Explore at:
Dataset updated
Jul 18, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
South Korea
Description
In the second half of 2021, websites regarding manufacturing were the most common websites to be targeted by malicious URL redirections, with ** percent of detected cases being found on these sites. Although manufacturing websites have been a common target for malware attacks before, finds on these sites have largely increased compared to the first half of the year, which recorded around ** percent of cases redirecting through that industry.
d
Seattle Parks and Recreation GIS Map Layer Web Services URL - Community...
catalog.data.gov
data.seattle.gov
+4more
Updated Jan 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.seattle.gov (2025). Seattle Parks and Recreation GIS Map Layer Web Services URL - Community Centers [Dataset]. https://catalog.data.gov/dataset/seattle-parks-and-recreation-gis-map-layer-web-services-url-community-centers-a654a
Explore at:
Dataset updated
Jan 31, 2025
Dataset provided by
data.seattle.gov
Area covered
Seattle
Description
Seattle Parks and Recreation ARCGIS park feature map layer web services are hosted on Seattle Public Utilities' ARCGIS server. This web services URL provides a live read only data connection to the Seattle Parks and Recreations Community Centers dataset.
U
URL Shortening Services Report
datainsightsmarket.com
doc, pdf, ppt
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). URL Shortening Services Report [Dataset]. https://www.datainsightsmarket.com/reports/url-shortening-services-1445675
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Feb 1, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global URL shortening services market is projected to expand at a CAGR of xx% during the forecast period (2023-2033). The market growth is primarily attributed to the rising adoption of social media, e-commerce, and mobile applications, which require shortened URLs to share content. Additionally, the increasing prevalence of smartphones and tablets is fueling the demand for convenient and efficient ways to access information, further driving the growth of URL shortening services. The key players in the URL shortening services market include TinyURL, Bit.ly, Ff.im, Is.gd, Twurl.nl, Clkin, CloudApp, Droplr, Geniuslink, Rebrandly, Short.com, Shortswitch, Dwz, and CMCC. These companies offer a range of services, including URL shortening, link tracking, and analytics, to meet the diverse needs of their customers. The market is highly competitive, with new entrants emerging regularly. However, the established players have a strong brand presence and a loyal customer base, which gives them a competitive edge.