50 datasets found

D
URL Filtering Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). URL Filtering Market Research Report 2033 [Dataset]. https://dataintelo.com/report/url-filtering-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
URL Filtering Market Outlook

According to our latest research, the global URL Filtering market size reached USD 2.45 billion in 2024, reflecting robust growth and heightened demand for advanced cybersecurity solutions. The market is expected to expand at a CAGR of 16.2% from 2025 to 2033, with projections estimating the market will achieve a value of USD 7.09 billion by 2033. This remarkable growth is driven by the escalating incidence of cyber threats, increasing digital transformation initiatives, and rising regulatory compliance requirements across industries worldwide.

One of the foremost growth factors propelling the URL Filtering market is the exponential rise in cyberattacks, phishing attempts, and malicious web content, which has made robust web security a non-negotiable priority for organizations. As internet usage surges and remote work becomes mainstream, enterprises are increasingly vulnerable to sophisticated threats targeting web traffic. URL Filtering solutions are at the forefront of defending businesses by blocking access to harmful or inappropriate websites, thereby reducing the risk of data breaches and ensuring employee productivity. Furthermore, the proliferation of cloud-based applications and the Internet of Things (IoT) has expanded the attack surface, compelling organizations to invest in advanced filtering technologies that can adapt to evolving threat landscapes.

Another significant driver is the tightening regulatory environment, especially in sectors such as BFSI, healthcare, and government, where data protection and privacy are paramount. Stringent regulations like GDPR, HIPAA, and industry-specific compliance mandates are compelling organizations to implement comprehensive web filtering policies to safeguard sensitive information and avoid hefty penalties. As a result, there is a growing preference for URL Filtering solutions that offer granular policy controls, real-time analytics, and seamless integration with existing security frameworks. This trend is further amplified by the increasing adoption of Bring Your Own Device (BYOD) policies and the need to secure endpoints across diverse networks.

The rapid advancement of artificial intelligence and machine learning technologies is also transforming the URL Filtering market. Modern solutions are leveraging AI-driven algorithms to detect and categorize new and emerging threats in real-time, significantly enhancing the accuracy and efficiency of filtering mechanisms. This technological evolution not only minimizes false positives but also enables proactive threat mitigation. Additionally, the integration of URL Filtering with broader security architectures such as Secure Web Gateways (SWG) and Security Information and Event Management (SIEM) systems is creating new growth avenues, as organizations seek holistic and scalable security solutions to address multi-vector threats.

Regionally, North America continues to dominate the URL Filtering market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The region’s leadership is attributed to the high adoption rate of advanced cybersecurity solutions, the presence of major technology providers, and a robust regulatory landscape. Meanwhile, Asia Pacific is witnessing the fastest growth, fueled by rapid digitalization, increasing internet penetration, and rising awareness about cybersecurity risks among enterprises of all sizes. As organizations globally recognize the strategic importance of web security, the demand for sophisticated URL Filtering solutions is set to accelerate across both developed and emerging markets.

Component Analysis

The URL Filtering market is segmented by component into Software, Hardware, and Services, each playing a pivotal role in shaping the overall market landscape. Software solutions represent the largest share, driven by their flexibility, scalability, and ease of deployment across diverse IT environments. Organizations are increasingly favoring cloud-based and on-premises software offerings that provide real-time filtering, dynamic categorization, and comprehensive policy management. The software segment is further augmented by the integration of advanced analytics and AI capabilities, enabling more accurate detection of malicious URLs and adaptive threat response. As cyber threats become more sophisticated, software vendors are continuously enhancing their offerings to include features such as SSL inspe
URL Classification Dataset for Malicious Traffic
kaggle.com
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob (2023). URL Classification Dataset for Malicious Traffic [Dataset]. https://www.kaggle.com/datasets/bobaaayoung/url-dataset/code
Explore at:
zip(2548282 bytes)Available download formats
Dataset updated
May 23, 2023
Authors
Jacob
Description
URL Classification Dataset for Malicious Traffic Detection

Dataset Overview

The Internet is a vast space that, while hosting a plethora of resources, also serves as a breeding ground for malicious activities. URLs are often leveraged as a primary tool by adversaries to conduct various types of cyber attacks. In response, the cybersecurity community has developed numerous techniques, with URL blacklisting being a prevalent method. However, this reactive approach falls short against the constantly evolving landscape of new malicious URLs.

This dataset aims to contribute to the proactive detection and categorization of URLs by analyzing their lexical features. It facilitates the development and testing of models capable of distinguishing between benign and malicious (malware) URLs. The dataset is divided into three main parts: training, validation, and testing sets, encompassing a broad spectrum of data points for comprehensive analysis.

Dataset Composition

train.csv: Contains 79,635 entries, a mix of benign and malware URLs, intended for training machine learning models.

valid.csv: Comprises 9,997 entries for model validation purposes, allowing for the fine-tuning of parameters and the assessment of preliminary performance.

test.csv: Includes 9,988 entries designed for the final evaluation of the model's ability to generalize to unseen data.

The URLs are categorized into two classes: - Benign(Good): URLs that are deemed safe and do not host any form of malicious content. - Malware(Bad): URLs associated with malicious websites, including those that distribute malware, phishing attempts, or other harmful activities.

Source of Data

The benign URLs were meticulously collected from Alexa's top-ranked websites, ensuring a representation of commonly visited and trusted domains. On the other hand, the malware URLs were curated from various sources known for listing active and dangerous URLs. Each URL underwent rigorous verification to ensure its correct classification, providing a reliable basis for model training and testing.

Application and Importance

This dataset is pivotal for researchers and cybersecurity practitioners aiming to devise effective strategies for early detection of malicious URLs. By employing lexical analysis and machine learning techniques, it is possible to identify potentially harmful URLs before they can impact users. Such proactive measures are essential in the ongoing battle against cyber threats, enhancing the overall security posture of online environments.
Website Classification
kaggle.com
zip
Updated May 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hetul Mehta (2021). Website Classification [Dataset]. https://www.kaggle.com/hetulmehta/website-classification
Explore at:
zip(2094838 bytes)Available download formats
Dataset updated
May 5, 2021
Authors
Hetul Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset was created by scraping different websites and then classifying them into different categories based on the extracted text.

Content

Below are the values each column has. The column names are pretty self-explanatory. website_url: URL link of the website. cleaned_website_text: the cleaned text content extracted from the
c
Website Classification Dataset
cubig.ai
zip
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Website Classification Dataset [Dataset]. https://cubig.ai/store/products/138/website-classification-dataset
Explore at:
zipAvailable download formats
Dataset updated
Feb 25, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.

2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.
h
website-categorization-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel, website-categorization-dataset [Dataset]. https://huggingface.co/datasets/Waffando/website-categorization-dataset
Explore at:
Authors
Daniel
Description
Waffando/website-categorization-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
PhishLegit Dataset: 10K URLs
kaggle.com
zip
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evil Spirit05 (2025). PhishLegit Dataset: 10K URLs [Dataset]. https://www.kaggle.com/datasets/evilspirit05/phishing-data
Explore at:
zip(100207 bytes)Available download formats
Dataset updated
Jan 14, 2025
Authors
Evil Spirit05
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a comprehensive set of features for analyzing and detecting phishing URLs. Each feature is carefully crafted to highlight attributes that distinguish legitimate URLs from potentially malicious ones.

Features and Their Descriptions

Domain

Type: Categorical (String)

Description: Represents the domain name of the website being analyzed.

Have_IP

Type: Binary (0/1)

Description: Indicates if the URL uses an IP address instead of a domain name.

0: No IP address (likely legitimate).

1: Uses an IP address (potentially suspicious).

Have_At

Type: Binary (0/1)

Description: Checks for the presence of the "@" symbol in the URL.

0: No "@" symbol (likely legitimate).

1: "@" symbol present (potentially suspicious).

URL_Length

Type: Numerical (Integer)

Description: Indicates the length of the URL.

Suggested Update: Classify as 0 for short and 1 for long URLs.

Redirection

Type: Binary (0/1)

Description: Detects if the URL involves redirections.

0: No redirections.

1: Redirections present (potentially suspicious).

https_Domain

Type: Binary (0/1)

Description: Checks if the domain part of the URL starts with "https."

0: Domain does not use "https" (less secure).

1: Domain uses "https" (more secure).

TinyURL

Type: Binary (0/1)

Description: Indicates if the URL uses a shortening service like TinyURL.

0: Not shortened.

1: Shortened (potentially suspicious).

Prefix/Suffix

Type: Binary (0/1)

Description: Checks for special characters (e.g., "-") in the domain name.

0: No special characters.

1: Special characters detected (potentially suspicious).

DNS_Record

Type: Binary (0/1)

Description: Indicates if the domain has a valid DNS record.

0: No valid DNS record (potentially suspicious).

1: Valid DNS record exists (likely legitimate).

Web_Traffic

Type: Binary (0/1)

Description: Evaluates web traffic rank of the domain.

0: Low or no traffic (potentially suspicious).

1: Significant traffic (likely legitimate).

Domain_Age

Type: Binary (0/1)

Description: Indicates whether the domain is relatively new.

0: Age < 6 months (potentially suspicious).

1: Age > 6 months (likely legitimate).

Domain_End

Type: Binary (0/1)

Description: Checks proximity of domain's expiration date.

0: Near expiration (potentially suspicious).

1: Far from expiration (likely legitimate).

iFrame

Type: Binary (0/1) * Description: Detects iFrames in the webpage content. * 0: No iFrames detected (likely legitimate). * 1: iFrames detected (potentially suspicious).

Mouse_Over

Type: Binary (0/1)

Description: Checks for mouse-over tricks on the webpage.

0: No tricks detected (likely legitimate).

1: Tricks detected (potentially suspicious).

Right_Click

Type: Binary (0/1)

Description: Indicates if right-click functionality is disabled.

0: Right-click enabled (likely legitimate).

1: Right-click disabled (potentially suspicious).

Web_Forwards

Type: Binary (0/1)

Description: Checks for multiple web forwards in the URL.

0: No web forwards.

1: Web forwards detected (potentially suspicious).

Label

Type: Binary (0/1)

Description: Target variable indicating URL classification.

0: Legitimate website.

1: Phishing or malicious website.

Additional Feature: URL_Depth

Definition: Number of forward slashes (/) in the URL path after the domain name.

Observations:

Frequent Depths: Depths of 2, 3, and 1 are most common.

Rare Depths: Very deep URLs (e.g., depth of 18, 17, 20) are rare and may indicate autogenerated or suspicious URLs.
Phishing URL Content Dataset
kaggle.com
zip
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
Explore at:
zip(62701 bytes)Available download formats
Dataset updated
Nov 25, 2024
Authors
Aaditey Pillai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Phishing URL Content Dataset

Executive Summary

Motivation:
Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

Applications:
- Building robust phishing detection systems.
- Enhancing security measures in email filtering and web browsing.
- Training cybersecurity practitioners in identifying malicious URLs.

The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

Description of Data

This dataset comprises two types of URLs:
1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

Key Features:
- URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
- Content-based features: Link density, iframe presence, external/internal links, and metadata.
- Certificate-based features: SSL/TLS details like validity period and organization.
- WHOIS data: Registration details like creation and expiration dates.

Statistics:
- Total Samples: 800 (400 phishing, 400 benign).
- Features: 22 including URL, domain, link density, and SSL attributes.

Power Analysis

To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

Exploratory Data Analysis (EDA)

Insights from EDA:
- Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

EDA visualizations are provided in the repository.

Link to Publicly Available Data and Code

Dataset: Phishing URL Dataset

Code Repository: GitHub - Phishing Detection

The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

Ethics Statement

Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
1. Protects User Privacy: No personally identifiable information is included.
2. Promotes Ethical Use: Intended solely for academic and research purposes.
3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

Risks:
- Misuse of the dataset for creating more deceptive phishing attacks.
- Over-reliance on outdated features as phishing tactics evolve.

Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

Open Source License

This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.
Requests to delist website content from Google in Finland 2020-2024, by...
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Requests to delist website content from Google in Finland 2020-2024, by category [Dataset]. https://www.statista.com/statistics/1186759/requests-to-delist-website-content-from-google-in-finland-by-category/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 30, 2020 - Apr 30, 2024
Area covered
Finland
Description
Between September 2020 and April 2024, over 22 percent of content requested for delisting from search results in Finland was due to name not found, while 21.7 percent of content requested for delisting regarded professional information.
Phishing URL Dataset (URL and Label)
kaggle.com
zip
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marry Janety (2025). Phishing URL Dataset (URL and Label) [Dataset]. https://www.kaggle.com/datasets/marryjanety/phishing-url-dataset-url-and-label/versions/1
Explore at:
zip(4686865 bytes)Available download formats
Dataset updated
Jun 10, 2025
Authors
Marry Janety
Description
This dataset contains a collection of URLs classified as phishing or benign (legitimate) based on trusted sources such as PhishTank, OpenPhish, and others. The dataset has been embedded into only three main columns for ease of transmission and learning:

url: Full URL address

label: URL category, i.e. phishing or benign

This dataset is suitable for use in research, basic machine learning learning, and development of URL-based phishing detection models.
Requests to delist website content from Google in Norway 2020-2024, by...
statista.com
Updated Nov 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Requests to delist website content from Google in Norway 2020-2024, by category [Dataset]. https://www.statista.com/statistics/1186738/requests-to-delist-website-content-from-google-in-norway-by-category/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 30, 2020 - Apr 30, 2024
Area covered
Norway
Description
Between September 2020 and April 2024, 26 percent of content evaluated for delisting from search results in Norway was due to name not found, while nearly 22 percent of content requested for delisting due to insufficient information.
d
Business Website Visits Data | USA Coverage | Industry/Context...
datarade.ai
.json, .csv, .txt
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BIGDBM (2024). Business Website Visits Data | USA Coverage | Industry/Context Categorisation - Training Set for ML and AI [Dataset]. https://datarade.ai/data-products/bigdbm-website-visits-data-with-industry-context-categorizati-bigdbm
Explore at:
.json, .csv, .txtAvailable download formats
Dataset updated
Jan 7, 2024
Dataset authored and provided by
BIGDBM
Area covered
United States of America
Description
Website visit data with URLs, categories, timestamps, and anonymized unique device identifiers.

Over 50 million unique devices per day. 1 billion+ raw signals per month with historical raw data available.

This data can be combined with demographic and lifestyle data to provide a richer view of the anonymous users/devices.

Intended for training ML and AI models.
🌐Phishing URLs Dataset (450K+ LINKS)
kaggle.com
zip
Updated Jan 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hassaan Mustafavi (2025). 🌐Phishing URLs Dataset (450K+ LINKS) [Dataset]. https://www.kaggle.com/datasets/hassaanmustafavi/phishing-urls-dataset/versions/1
Explore at:
zip(7923499 bytes)Available download formats
Dataset updated
Jan 25, 2025
Authors
Hassaan Mustafavi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Don't forget to hit the upvote🙏🙏

🔖 Overview

Phishing is a significant cybersecurity threat that deceives users into divulging sensitive information through fake websites. This dataset contains a collection of URLs, categorized as either phishing or legitimate, and is designed to support machine learning, data analysis, and cybersecurity research.

📊 Dataset Summary

Total URLs: 45k+

Categories: Phishing, Legitimate

Features: url, type

File Format: CSV

📚 Columns Description

Column Description
url 🔗 The web address to be analyzed.
type 🎯 The classification of the URL (phishing or legitimate).

📊 Key Features

✅ Balanced Dataset: A near-equal distribution of phishing and legitimate URLs.

🌐 Real-world Examples: URLs collected from various online sources.

🔒 Anonymized Data: No personally identifiable information is included.

⚙️ Ready for Analysis: Cleaned and pre-processed for immediate use in machine learning projects.

🎯 Potential Use Cases

Building and training phishing detection algorithms.

Evaluating the performance of URL classification models.

Conducting cybersecurity research and analysis.

Fine-Tuning Pre-Trained Models

🚀 Get Started!

Ready to dive into the world of cybersecurity? Explore the dataset, build models, and contribute to creating a safer online environment.

Happy coding! ✨
Most used website types in the United Kingdom (UK) 2019, by website category...
statista.com
Updated Feb 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2020). Most used website types in the United Kingdom (UK) 2019, by website category [Dataset]. https://www.statista.com/statistics/1099782/most-popular-website-types-in-the-uk/
Explore at:
Dataset updated
Feb 18, 2020
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United Kingdom
Description
The most popular types of websites in the United Kingdom (UK) in 2019 were social media and chat websites, according to a recent survey. Other popular website categories were news, mail, and shopping.
D
CompuCrawl: Full database and code
dataverse.nl
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
Explore at:
Unique identifier
https://doi.org/10.34894/OBVAOY
Dataset updated
Sep 23, 2025
Dataset provided by
DataverseNL
Authors
Richard Haans; Richard Haans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
c
Target products dataset
crawlfeeds.com
csv, zip
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Target products dataset [Dataset]. https://crawlfeeds.com/datasets/target-products-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Sep 10, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The Target Products Dataset is a robust collection in CSV format, featuring 1.3 million product records sourced from Target's online platform. This dataset contains rich details on a wide range of products, including product titles, URLs, pricing, availability, and more. It is an ideal resource for businesses, researchers, and data scientists interested in analyzing retail trends, product availability, and pricing strategies.

Key Data Fields:

Title: Name of the product.

URL: Direct link to the product page.

Brand: The brand associated with the product.

Main Image: URL of the main product image.

SKU: Unique Stock Keeping Unit identifier.

Description: A structured product description.

Raw Description: The original product description before any processing.

GTIN13: Global Trade Item Number (GTIN) in 13-digit format.

Currency: The currency in which the product is priced.

Price: Price of the product.

Availability: Availability status of the product (e.g., in stock, out of stock).

Available Delivery Method: Methods through which the product can be delivered.

Available Branch: Information on availability at specific store locations.

Primary Category: The main category to which the product belongs.

Sub Category 1, 2, 3: Further sub-categorization of the product.

Images: URLs to additional product images.

Raw Specifications: Unprocessed specifications of the product.

Specifications: Structured product specifications.

Highlights: Key highlights and features of the product.

Raw Highlights: Unstructured highlights before processing.

Uniq ID: A unique identifier for each product.

Scraped At: The timestamp indicating when the data was collected.

Use Cases:

Retail Analytics: Analyze pricing trends, brand popularity, and product availability across categories.

Product Categorization: Study the classification of products into primary and sub-categories.

E-commerce Analysis: Use this dataset for consumer behavior studies, inventory management, or competitive analysis.

Recommendation Systems: Build product recommendation engines using product features, pricing, and availability data.
URL CLASSIFICATION DMOZ
kaggle.com
zip
Updated May 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Revanth (2019). URL CLASSIFICATION DMOZ [Dataset]. https://www.kaggle.com/revanthrex/url-classification
Explore at:
zip(19524582 bytes)Available download formats
Dataset updated
May 9, 2019
Authors
Revanth
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

DMOZ is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a passionate, global community of volunteer editors. It was historically known as the Open Directory Project (ODP).

DMOZ was founded in the spirit of the Open Source movement and is the only major directory that is 100% free. There is not, nor will there ever be, a cost to submit a site to the directory, and/or to use the directory's data. Its data is made available for free to anyone who agrees to comply with our free use license.

Content

DMOZ is the most widely distributed data base of Web content classified by humans. It serves as input to the Web's largest and most popular search engines and portals, including AOL Search, Google, Lycos, HotBot, and hundreds of others.

Acknowledgements

Reference

https://dmoz-odp.org/

Column	Description
url	🔗 The web address to be analyzed.
type	🎯 The classification of the URL (phishing or legitimate).

IAB-Taxonomy URL Content Dataset

kaggle.com

zip

Updated May 27, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

IshikaaaaaaThakur (2025). IAB-Taxonomy URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/ishikaaaaaathakur/iab-taxonomy-url-content-dataset

Explore at:

zip(88571739 bytes)Available download formats

Dataset updated

May 27, 2025

Authors

IshikaaaaaaThakur

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The IAB Content Taxonomy is an industry standard developed by the Interactive Advertising Bureau to classify digital content into consistent categories. It's widely used for contextual advertising, audience targeting, and content personalization. The taxonomy has multiple tiers:

Tier 1: Broad topics (e.g., Arts & Entertainment)

Tier 2–4: Increasingly granular categories (e.g., Music → Music News → Artist Interviews)

📑 Column Descriptions

Column Name	Description
`Tier 1`	The top-level IAB category (e.g., News, Technology, Sports). Broadest classification.
`Tier 2`	The subcategory under Tier 1 (e.g., Mobile Tech, Personal Finance).
`Tier 3`	More specific subcategory, available for deeper classification.
`Tier 4`	Most granular classification (optional).
`URL`	The webpage address where the content was scraped from.
`Title`	The HTML `<title>` tag of the webpage.
`Description`	The meta description summarizing the page content.
`Keywords`	Keywords extracted from metadata or inferred via NLP(optional).
`Site Name`	The domain or name of the website.
`Content`	The main textual content scraped from the page.
`License`	The applicable Creative Commons license (e.g., CC BY, CC BY-SA).
`Language`	Primary language of the content (e.g., en, fr, es).

USA Acura Dealership Location Information
dataandsons.com
csv, zip
Updated Apr 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
http://datamosquito.com (2022). USA Acura Dealership Location Information [Dataset]. https://www.dataandsons.com/categories/location-lists/usa-acura-dealership-location-information
Explore at:
csv, zipAvailable download formats
Dataset updated
Apr 16, 2022
Dataset provided by
Authors
http://datamosquito.com
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About this Dataset

Contains complete list of all Acura Dealerships in the US Data contains Dealership Name, Address, City, State, Zipcode, Phone Number, Latitude, Longitude, Website URL

Category

Location Lists

Keywords

Acura,Address,URL

Row Count

273

Price

$69.00
Breakdown of independent cross-border e-commerce website from China 2021, by...
statista.com
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2021). Breakdown of independent cross-border e-commerce website from China 2021, by category [Dataset]. https://www.statista.com/statistics/1272828/china-distribution-of-independent-cross-border-e-commerce-websites-by-category/
Explore at:
Dataset updated
Oct 29, 2021
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2021
Area covered
China
Description
As of 2021, nearly ********* of China's ** leading independent cross-border e-commerce retailers focused on selling consumer electronics. In comparison, apparel retailers accounted for around **** percent of the independent exporting e-commerce websites.
c
URL Shortener Market is Growing at Compound Annual Growth Rate of 20.00%...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). URL Shortener Market is Growing at Compound Annual Growth Rate of 20.00% from 2023 to 2030. [Dataset]. https://www.cognitivemarketresearch.com/url-shortener-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Aug 9, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global URL Shortener market will be USD XX billion in 2023 and expand at a compound yearly growth rate (CAGR) of 20.00% from 2023 to 2030.

The global URL Shortener market will expand significantly by 20.00% CAGR between 2023 and 2030. The demand for URL Shorteners is due to the increasing usage of Social Media Platforms. Demand for growing adoption of mobile devices in the URL Shortener market. The tools and BFSI category held the highest URL Shortener market revenue share in 2023. North America will continue to lead, whereas the Asia Pacific URL Shortener market will experience the most substantial growth until 2030.

Increasing Usage of Social Media Platforms to Provide Viable Market Output

The global URL shortener market is experiencing a significant surge in demand due to the increasing usage of social media platforms. With the growing reliance on platforms like Twitter, Facebook, and Instagram, there is a heightened need for concise and shareable URLs. URL shorteners are pivotal in condensing lengthy links, enhancing user experience, and facilitating easy sharing across various social media channels. As businesses and individuals alike seek to optimize their online presence, the market for URL shorteners is poised for continuous growth, driven by the expanding influence of social media in disseminating information efficiently.

Growing Adoption of Mobile Devices to Propel Market Growth

The global URL shortener market is poised for growth, primarily fueled by the increasing adoption of mobile devices. As mobile usage continues to surge globally, concise and shareable URLs become paramount for seamless communication across various platforms. URL shorteners facilitate easy sharing on social media, messaging apps, and other mobile channels, enhancing user experience. The market is expected to witness significant expansion as businesses and individuals recognize the efficiency and convenience offered by URL-shortening services in the mobile-driven digital landscape, contributing to the overall growth of the URL-shortener market.

The Adoption of Digital Marketing Services Across Various Sectors Fuels the Market Growth

Market Dynamics For URL shorteners

Key Drivers for URL Shortener

Growing Use of Digital Marketing and Social Media: The demand for URL shorteners has increased due to the rapid rise of character-limited websites like Instagram, LinkedIn, and Twitter (now X). Shorter links are used by marketers, influencers, and companies to conserve space, monitor performance, and improve the aesthetic appeal of posts and bios. Demand for Performance Monitoring and Link Analytics: These days, a lot of URL shorteners have analytics integrated in. These are being used by businesses more and more to track conversion rates, user geolocation, referral traffic, and click-through rates (CTR). Because of this, abbreviated URLs are an essential tool for performance evaluation and digital marketing.

Key Restraints for URL Shortener

Data Security and Privacy Issues: Shortened URLs are a technique for malicious redirects, malware, and phishing since they can conceal the final location. Shortened URLs are distrusted by some users and IT departments as cybersecurity worries increase, which could have an impact on adoption in high-security industries. Reliance on Outside Services: Companies who depend on third-party URL shortening services (like Bitly or TinyURL) run the risk of losing link access in the event that the provider goes down or stops providing their services. This reliance on outside infrastructure poses a risk to long-term digital plans.

Key Trends for URL Shortener

Growth of Custom and Branded Short Domains: In an effort to improve click-through rates, brand memory, and trust, businesses are switching from generic URL shorteners to branded domains (such as "yourbrand.co/offer"). CRMs and other marketing platforms are starting to include custom URL shorteners. Connectivity with CRM and Marketing Tools: URL shorteners are becoming more and more integrated into automated campaigns using programs like HubSpot, Salesforce, and Mailchimp. This pattern coincides with the growth of data-driven outreach and marketing automation.

Impact of COVID–19 on the URL Shortener Market?

While witnessing steady growth, the global URL shortener market faced challenges d...

Facebook

Twitter

Click to copy link

Link copied

Cite

Dataintelo (2025). URL Filtering Market Research Report 2033 [Dataset]. https://dataintelo.com/report/url-filtering-market

URL Filtering Market Research Report 2033

Explore at:

pptx, csv, pdfAvailable download formats

Dataset updated

Oct 1, 2025

Dataset authored and provided by

Dataintelo

License

https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

Time period covered

2024 - 2032

Area covered

Global

Description

URL Filtering Market Outlook

According to our latest research, the global URL Filtering market size reached USD 2.45 billion in 2024, reflecting robust growth and heightened demand for advanced cybersecurity solutions. The market is expected to expand at a CAGR of 16.2% from 2025 to 2033, with projections estimating the market will achieve a value of USD 7.09 billion by 2033. This remarkable growth is driven by the escalating incidence of cyber threats, increasing digital transformation initiatives, and rising regulatory compliance requirements across industries worldwide.

One of the foremost growth factors propelling the URL Filtering market is the exponential rise in cyberattacks, phishing attempts, and malicious web content, which has made robust web security a non-negotiable priority for organizations. As internet usage surges and remote work becomes mainstream, enterprises are increasingly vulnerable to sophisticated threats targeting web traffic. URL Filtering solutions are at the forefront of defending businesses by blocking access to harmful or inappropriate websites, thereby reducing the risk of data breaches and ensuring employee productivity. Furthermore, the proliferation of cloud-based applications and the Internet of Things (IoT) has expanded the attack surface, compelling organizations to invest in advanced filtering technologies that can adapt to evolving threat landscapes.

Another significant driver is the tightening regulatory environment, especially in sectors such as BFSI, healthcare, and government, where data protection and privacy are paramount. Stringent regulations like GDPR, HIPAA, and industry-specific compliance mandates are compelling organizations to implement comprehensive web filtering policies to safeguard sensitive information and avoid hefty penalties. As a result, there is a growing preference for URL Filtering solutions that offer granular policy controls, real-time analytics, and seamless integration with existing security frameworks. This trend is further amplified by the increasing adoption of Bring Your Own Device (BYOD) policies and the need to secure endpoints across diverse networks.

The rapid advancement of artificial intelligence and machine learning technologies is also transforming the URL Filtering market. Modern solutions are leveraging AI-driven algorithms to detect and categorize new and emerging threats in real-time, significantly enhancing the accuracy and efficiency of filtering mechanisms. This technological evolution not only minimizes false positives but also enables proactive threat mitigation. Additionally, the integration of URL Filtering with broader security architectures such as Secure Web Gateways (SWG) and Security Information and Event Management (SIEM) systems is creating new growth avenues, as organizations seek holistic and scalable security solutions to address multi-vector threats.

Regionally, North America continues to dominate the URL Filtering market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The region’s leadership is attributed to the high adoption rate of advanced cybersecurity solutions, the presence of major technology providers, and a robust regulatory landscape. Meanwhile, Asia Pacific is witnessing the fastest growth, fueled by rapid digitalization, increasing internet penetration, and rising awareness about cybersecurity risks among enterprises of all sizes. As organizations globally recognize the strategic importance of web security, the demand for sophisticated URL Filtering solutions is set to accelerate across both developed and emerging markets.

Component Analysis

The URL Filtering market is segmented by component into Software, Hardware, and Services, each playing a pivotal role in shaping the overall market landscape. Software solutions represent the largest share, driven by their flexibility, scalability, and ease of deployment across diverse IT environments. Organizations are increasingly favoring cloud-based and on-premises software offerings that provide real-time filtering, dynamic categorization, and comprehensive policy management. The software segment is further augmented by the integration of advanced analytics and AI capabilities, enabling more accurate detection of malicious URLs and adaptive threat response. As cyber threats become more sophisticated, software vendors are continuously enhancing their offerings to include features such as SSL inspe

Clear search

Close search

Google apps

Main menu

URL Filtering Market Research Report 2033

URL Filtering Market Outlook

Component Analysis

URL Classification Dataset for Malicious Traffic

URL Classification Dataset for Malicious Traffic Detection

Dataset Overview

Dataset Composition

Source of Data

Application and Importance

Website Classification

Context

Content

Website Classification Dataset

website-categorization-dataset

PhishLegit Dataset: 10K URLs

Features and Their Descriptions

Domain

Have_IP

Have_At

URL_Length

Redirection

https_Domain

TinyURL

Prefix/Suffix

DNS_Record

Web_Traffic

Domain_Age

Domain_End

iFrame

Mouse_Over

Right_Click

Web_Forwards

Label

Additional Feature: URL_Depth

Observations:

Phishing URL Content Dataset

Phishing URL Content Dataset

Executive Summary

Description of Data

Power Analysis

Exploratory Data Analysis (EDA)

Link to Publicly Available Data and Code

Ethics Statement

Open Source License

Requests to delist website content from Google in Finland 2020-2024, by...

Phishing URL Dataset (URL and Label)

Requests to delist website content from Google in Norway 2020-2024, by...

Business Website Visits Data | USA Coverage | Industry/Context...

🌐Phishing URLs Dataset (450K+ LINKS)

Don't forget to hit the upvote🙏🙏

🔖 Overview

📊 Dataset Summary

📚 Columns Description

📊 Key Features

🎯 Potential Use Cases

🚀 Get Started!

Most used website types in the United Kingdom (UK) 2019, by website category...

CompuCrawl: Full database and code

Target products dataset

Use Cases:

URL CLASSIFICATION DMOZ

Context

Content

Acknowledgements

IAB-Taxonomy URL Content Dataset

📑 Column Descriptions

USA Acura Dealership Location Information

About this Dataset

Category

Keywords

Row Count

Price

Breakdown of independent cross-border e-commerce website from China 2021, by...

URL Shortener Market is Growing at Compound Annual Growth Rate of 20.00%...

URL Filtering Market Research Report 2033See More Versions

URL Filtering Market Outlook

Component Analysis

URL Filtering Market Research Report 2033