https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to jdb-url.com (Domain). Get insights into ownership history and changes over time.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to free-tiny-url.com (Domain). Get insights into ownership history and changes over time.
Model Card: URL Classifications Dataset
Dataset Summary
The URL Classifications Dataset is a collection of URL classifications for PDF documents, primarily derived from the SafeDocs corpus. It contains multiple CSV files with different subsets of classifications, including both raw and processed data.
Supported Tasks
This dataset supports the following tasks:
Text Classification URL-based Document Classification PDF Content Inference
Languages
The… See the full description on the dataset page: https://huggingface.co/datasets/snats/url-classifications.
The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to the-duke-of-url.com (Domain). Get insights into ownership history and changes over time.
Licence Ouverte / Open Licence 1.0https://www.etalab.gouv.fr/wp-content/uploads/2014/05/Open_Licence.pdf
License information was derived automatically
Statistics of visits to the pages of the website of the City of Brussels with the address of the page (according to www.bruxelles.be, the number of pages viewed, the number of unique consultations, the average time spent on the page. Source: Google Analytics.Each page of the website of the City of Brussels is identified by 4 figures appearing in its address.A search on these 4 figures makes it possible to obtain the statistics relating to the page in question.
Licence Ouverte / Open Licence 1.0https://www.etalab.gouv.fr/wp-content/uploads/2014/05/Open_Licence.pdf
License information was derived automatically
Visitor statistics of the pages of the website of the City of Brussels (2014) with the address of the page (following www.brussels.be), the number of pages viewed, the number of unique visits. Source: Google Analytics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
The provided dataset includes 11430 URLs with 87 extracted features.The dataset are designed to be used as a benchmark for machine learning based phishing detection systems.The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs.
Features are from three different classes:
56 extracted from the structure and syntax of URLs 24 extracted from the content of their correspondent pages 7 are extracetd by querying external services.
The… See the full description on the dataset page: https://huggingface.co/datasets/pirocheto/phishing-url.
This is a URL classification problem.Dataset consists of 17 features,out of it 16 are independent and statistical_report is dependent feature.Here in statistical report 0-Benign (good) and 1-malicious(bad).
Datset refference link: https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls/data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.
Citation: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545
Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.
I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/
Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.
My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).
I am also interested in identifying fraudulent domains and understanding malicious URL patterns.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore historical ownership and registration records by performing a reverse Whois lookup for the email address weychi@ms12.url.com.tw..
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the URL Rewrite technology, compiled through global website indexing conducted by WebTechSurvey.
https://data.gov.tw/licensehttps://data.gov.tw/license
Mobile Website URL Information on the World Wide Web
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.
Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018
Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.
The data in these CSV files include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
https://data.norge.no/nlod/en/2.0/https://data.norge.no/nlod/en/2.0/
The dataset contains Internet addresses for government agencies and municipalities. It is intended to be used together with the data set of the units of public administration. This dataset is part of several data sets about public enterprises. The data sets are referred to as the agency base and were previously on Norge.no. They contain an overview of public enterprises, i.e. government agencies and enterprises’ central, regional and local units, county municipalities and municipalities. Data sets are not updated. The data sets contain information about the name of the enterprise, visiting address, postal address, telephone number, e-mail address, web address (URL), map coordinates (position), coverage (which municipalities the business covers), organisation number, overarching activity, type of organisation, type of affiliation (the way in which an enterprise is linked to the executive government) and quality assessments of the website. Look up on the keyword/tag agency base to see the other datasets. The establishment base is closed and is no longer maintained by the Directorate of Digitalisation (formerly Difi). The datasets were last updated in January 2012. Note that this does not mean that all data was updated in January 2012, but that the last changes were made at that time. Reference to the source When using this dataset, we ask that the source be referred to as follows (cf the NLOD license): The service is based on open data sets from the Directorate of Digitalisation and is subject to the Norwegian License for Public Data (NLOD). The data was last updated in 2012 and is no longer maintained by the Directorate of Digitalisation.
In the second half of 2021, websites regarding manufacturing were the most common websites to be targeted by malicious URL redirections, with ** percent of detected cases being found on these sites. Although manufacturing websites have been a common target for malware attacks before, finds on these sites have largely increased compared to the first half of the year, which recorded around ** percent of cases redirecting through that industry.
Seattle Parks and Recreation ARCGIS park feature map layer web services are hosted on Seattle Public Utilities' ARCGIS server. This web services URL provides a live read only data connection to the Seattle Parks and Recreations Community Centers dataset.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global URL shortening services market is projected to expand at a CAGR of xx% during the forecast period (2023-2033). The market growth is primarily attributed to the rising adoption of social media, e-commerce, and mobile applications, which require shortened URLs to share content. Additionally, the increasing prevalence of smartphones and tablets is fueling the demand for convenient and efficient ways to access information, further driving the growth of URL shortening services. The key players in the URL shortening services market include TinyURL, Bit.ly, Ff.im, Is.gd, Twurl.nl, Clkin, CloudApp, Droplr, Geniuslink, Rebrandly, Short.com, Shortswitch, Dwz, and CMCC. These companies offer a range of services, including URL shortening, link tracking, and analytics, to meet the diverse needs of their customers. The market is highly competitive, with new entrants emerging regularly. However, the established players have a strong brand presence and a loyal customer base, which gives them a competitive edge.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to jdb-url.com (Domain). Get insights into ownership history and changes over time.