51 datasets found

Website Classification
kaggle.com
zip
Updated May 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hetul Mehta (2021). Website Classification [Dataset]. https://www.kaggle.com/hetulmehta/website-classification
Explore at:
zip(2094838 bytes)Available download formats
Dataset updated
May 5, 2021
Authors
Hetul Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset was created by scraping different websites and then classifying them into different categories based on the extracted text.

Content

Below are the values each column has. The column names are pretty self-explanatory. website_url: URL link of the website. cleaned_website_text: the cleaned text content extracted from the
c
Website Classification Dataset
cubig.ai
zip
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Website Classification Dataset [Dataset]. https://cubig.ai/store/products/138/website-classification-dataset
Explore at:
zipAvailable download formats
Dataset updated
Feb 25, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.

2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.
URL Classification Dataset for Malicious Traffic
kaggle.com
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob (2023). URL Classification Dataset for Malicious Traffic [Dataset]. https://www.kaggle.com/datasets/bobaaayoung/url-dataset/code
Explore at:
zip(2548282 bytes)Available download formats
Dataset updated
May 23, 2023
Authors
Jacob
Description
URL Classification Dataset for Malicious Traffic Detection

Dataset Overview

The Internet is a vast space that, while hosting a plethora of resources, also serves as a breeding ground for malicious activities. URLs are often leveraged as a primary tool by adversaries to conduct various types of cyber attacks. In response, the cybersecurity community has developed numerous techniques, with URL blacklisting being a prevalent method. However, this reactive approach falls short against the constantly evolving landscape of new malicious URLs.

This dataset aims to contribute to the proactive detection and categorization of URLs by analyzing their lexical features. It facilitates the development and testing of models capable of distinguishing between benign and malicious (malware) URLs. The dataset is divided into three main parts: training, validation, and testing sets, encompassing a broad spectrum of data points for comprehensive analysis.

Dataset Composition

train.csv: Contains 79,635 entries, a mix of benign and malware URLs, intended for training machine learning models.

valid.csv: Comprises 9,997 entries for model validation purposes, allowing for the fine-tuning of parameters and the assessment of preliminary performance.

test.csv: Includes 9,988 entries designed for the final evaluation of the model's ability to generalize to unseen data.

The URLs are categorized into two classes: - Benign(Good): URLs that are deemed safe and do not host any form of malicious content. - Malware(Bad): URLs associated with malicious websites, including those that distribute malware, phishing attempts, or other harmful activities.

Source of Data

The benign URLs were meticulously collected from Alexa's top-ranked websites, ensuring a representation of commonly visited and trusted domains. On the other hand, the malware URLs were curated from various sources known for listing active and dangerous URLs. Each URL underwent rigorous verification to ensure its correct classification, providing a reliable basis for model training and testing.

Application and Importance

This dataset is pivotal for researchers and cybersecurity practitioners aiming to devise effective strategies for early detection of malicious URLs. By employing lexical analysis and machine learning techniques, it is possible to identify potentially harmful URLs before they can impact users. Such proactive measures are essential in the ongoing battle against cyber threats, enhancing the overall security posture of online environments.
URL CLASSIFICATION DMOZ
kaggle.com
zip
Updated May 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Revanth (2019). URL CLASSIFICATION DMOZ [Dataset]. https://www.kaggle.com/revanthrex/url-classification
Explore at:
zip(19524582 bytes)Available download formats
Dataset updated
May 9, 2019
Authors
Revanth
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

DMOZ is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a passionate, global community of volunteer editors. It was historically known as the Open Directory Project (ODP).

DMOZ was founded in the spirit of the Open Source movement and is the only major directory that is 100% free. There is not, nor will there ever be, a cost to submit a site to the directory, and/or to use the directory's data. Its data is made available for free to anyone who agrees to comply with our free use license.

Content

DMOZ is the most widely distributed data base of Web content classified by humans. It serves as input to the Web's largest and most popular search engines and portals, including AOL Search, Google, Lycos, HotBot, and hundreds of others.

Acknowledgements

Reference

https://dmoz-odp.org/
G
Web Categorization Services Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Web Categorization Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/web-categorization-services-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Web Categorization Services Market Outlook

According to our latest research, the global Web Categorization Services market size reached USD 2.41 billion in 2024, demonstrating strong momentum driven by the imperative need for advanced digital content management and security solutions. The market is expected to expand at a robust CAGR of 13.6% during the forecast period, reaching a projected value of USD 7.09 billion by 2033. Growth in this sector is primarily fueled by the increasing sophistication of cyber threats, the proliferation of web-based applications, and the growing adoption of cloud-based technologies across diverse industry verticals.

A significant growth factor for the Web Categorization Services market is the exponential rise in online content and the corresponding necessity for organizations to manage, filter, and secure web traffic efficiently. As businesses increasingly migrate operations to digital platforms, the sheer volume of web contentÂ—ranging from benign corporate resources to potentially harmful or non-compliant materialÂ—necessitates sophisticated categorization solutions. Web categorization services leverage artificial intelligence and machine learning algorithms to analyze and classify millions of web pages in real time, enabling organizations to enforce content policies, enhance productivity, and mitigate the risks associated with inappropriate or malicious sites. This capability is particularly crucial in sectors such as BFSI, healthcare, and government, where regulatory compliance and data protection are paramount.

Another pivotal driver behind market growth is the rising emphasis on brand safety and digital advertising effectiveness. As enterprises allocate larger portions of their marketing budgets to online channels, the risk of brand advertisements appearing alongside inappropriate or harmful content has become a critical concern. Web categorization services play a vital role in ensuring brand safety by dynamically assessing the context and category of web pages where ads are displayed, thus protecting brand reputation and optimizing ad targeting. This trend is further amplified by the increasing adoption of programmatic advertising and real-time bidding platforms, which rely heavily on accurate web categorization to maximize campaign ROI and minimize reputational risk.

The surge in remote work and the widespread adoption of bring-your-own-device (BYOD) policies have also contributed to the expansion of the Web Categorization Services market. With employees accessing corporate networks from diverse locations and devices, organizations face heightened challenges in maintaining security and compliance. Web categorization solutions enable IT departments to implement granular access controls, monitor user activity, and prevent data breaches by blocking access to malicious or unauthorized websites. The integration of web categorization with existing security information and event management (SIEM) systems further enhances threat intelligence and incident response capabilities, making these services indispensable in the modern enterprise security stack.

In the context of securing web environments, the role of a Secure Web Gateway (SWG) is becoming increasingly pivotal. As organizations strive to protect their networks from sophisticated cyber threats, SWGs offer a comprehensive solution by filtering unwanted software and malware from user-initiated web traffic. They ensure that only safe and compliant content is accessed, thereby safeguarding sensitive data and maintaining regulatory compliance. The integration of SWGs with web categorization services enhances the overall security posture by providing real-time threat intelligence and advanced content filtering capabilities. This synergy is particularly beneficial for enterprises with distributed workforces and BYOD policies, where maintaining a secure web environment is critical to operational integrity.

Regionally, North America continues to dominate the Web Categorization Services market, accounting for the largest share in 2024. This leadership is attributed to the early adoption of advanced cybersecurity solutions, the presence of major technology vendors, and stringent regulatory frameworks governing data privacy and digital content. However, the Asia Pacific region is anticipated to exhibit the highest CAGR
Soil Series Classification Database (SC)
agdatacommons.nal.usda.gov
bin
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA Natural Resources Conservation Service, Soil Survey Staff (2025). Soil Series Classification Database (SC) [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Soil_Series_Classification_Database_SC_/24663174
Explore at:
binAvailable download formats
Dataset updated
Nov 21, 2025
Dataset provided by
Natural Resources Conservation Servicehttp://www.nrcs.usda.gov/
United States Department of Agriculturehttp://usda.gov/
Authors
USDA Natural Resources Conservation Service, Soil Survey Staff
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The USDA-NRCS Soil Series Classification Database contains the taxonomic classification of each soil series identified in the United States, Territories, Commonwealths, and Island Nations served by USDA-NRCS. Along with the taxonomic classification, the database contains other information about the soil series, such as office of responsibility, series status, dates of origin and establishment, and geographic areas of usage. The database is maintained by the soils staff of the NRCS MLRA Soil Survey Region Offices across the country. Additions and changes are continually being made, resulting from on going soil survey work and refinement of the soil classification system. As the database is updated, the changes are immediately available to the user, so the data retrieved is always the most current. The Web access to this soil classification database provides capabilities to view the contents of individual series records, to query the database on any data element and produce a report with the selected soils, or to produce national reports with all soils in the database. The standard reports available allow the user to display the soils by series name or by taxonomic classification. The SC database was migrated into the NASIS database with version 6.2. Resources in this dataset:Resource Title: Website Pointer to Soil Series Classification Database (SC). File Name: Web Page, url: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/class/data/?cid=nrcs142p2_053583 Supports the following queries:

View Classification Data by Series Name

Create Report for a List of Series (with download option)

Create Report by Query (with download option)

Create National Report (with download option)

Soil Series Name Search
w
Web Data Commons - Categorization Gold Standard
webdatacommons.org
json
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Bizer; Anna Primpeli; Helene Bechtold (2025). Web Data Commons - Categorization Gold Standard [Dataset]. http://webdatacommons.org/categorization/
Explore at:
jsonAvailable download formats
Dataset updated
Feb 17, 2025
Authors
Christian Bizer; Anna Primpeli; Helene Bechtold
Description
The training dataset consisting of 20 million pairs of product offers referring to the same products categorized into 25 product categories. We also created a categorization gold standard by manually verifying more than 2000 clusters of offers belonging to 25 different product categories.
D
Web Categorization Services Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Web Categorization Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/web-categorization-services-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Web Categorization Services Market Outlook

According to our latest research, the global Web Categorization Services market size reached USD 3.42 billion in 2024, with a robust year-on-year growth trajectory. The market is expected to expand at a CAGR of 14.2% during the forecast period, projecting a value of USD 9.03 billion by 2033. This growth is primarily driven by the increasing need for advanced content filtering, regulatory compliance, and enhanced cybersecurity measures across diverse industries. The rising adoption of cloud-based solutions and the proliferation of digital content are further fueling market expansion, as organizations worldwide prioritize secure and efficient web usage management.

One of the chief growth drivers for the Web Categorization Services market is the escalating frequency and sophistication of cyber threats. Enterprises are under mounting pressure to safeguard their digital assets and ensure compliance with evolving regulatory frameworks. Web categorization services play a pivotal role in this context by enabling organizations to classify and filter web content, thereby minimizing exposure to malicious or inappropriate sites. The surge in remote working and the widespread digital transformation initiatives have further accentuated the need for robust web categorization solutions, as businesses seek to protect their distributed workforce and sensitive data from cyber risks.

Another significant factor propelling market growth is the increasing emphasis on brand safety and ad targeting within the digital advertising ecosystem. As brands allocate larger budgets to online advertising, the importance of ensuring that ads are displayed on appropriate and safe web pages has become paramount. Web categorization services empower advertisers and publishers to effectively filter out unsuitable or harmful content, thereby preserving brand reputation and maximizing campaign ROI. The integration of artificial intelligence and machine learning technologies into these services is enhancing their accuracy and efficiency, making them indispensable tools in the digital marketing landscape.

The growing need for regulatory compliance across sectors such as BFSI, healthcare, and government is also bolstering the adoption of web categorization services. Stringent data protection laws and industry-specific regulations require organizations to monitor and control online activities diligently. Web categorization solutions facilitate compliance by providing granular visibility into web usage and enabling the enforcement of acceptable use policies. As regulatory landscapes become increasingly complex, the demand for automated, scalable, and customizable web categorization services is expected to surge, driving sustained market growth over the forecast period.

From a regional perspective, North America continues to dominate the Web Categorization Services market, accounting for the largest share in 2024. This leadership is attributed to the high digital adoption rates, presence of major technology vendors, and stringent cybersecurity regulations in the region. Meanwhile, Asia Pacific is witnessing the fastest growth, fueled by rapid digitalization, expanding internet user base, and increasing investments in cybersecurity infrastructure. Europe also holds a significant market share, driven by robust regulatory frameworks such as GDPR and a strong focus on data privacy. The Middle East & Africa and Latin America are emerging as promising markets, supported by growing awareness of cybersecurity and rising adoption of digital technologies.

Component Analysis

The Component segment of the Web Categorization Services market is bifurcated into software and services, each playing a critical role in the overall value proposition offered to end-users. The software segment comprises standalone and integrated solutions that automate the process of classifying and filtering web content. These solutions leverage advanced algorithms, artificial intelligence, and machine learning to deliver precise and real-time categorization. The growing complexity of web content and the need for scalable, customizable solutions have spurred significant investments in software development, resulting in continuous innovation and feature enhancements.

On the other hand, the services segment encompasses a range of offerings, including consulting, implementation, suppo
Dataset for IAB text/website classification
kaggle.com
zip
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AKS (2023). Dataset for IAB text/website classification [Dataset]. https://www.kaggle.com/datasets/bpmtips/websiteiabcategorization
Explore at:
zip(7811746609 bytes)Available download formats
Dataset updated
Dec 28, 2023
Authors
AKS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the base training dataset for training a IAB taxonomy text/website classification model. We are actively using it at https://front-page.com for domain/website classification and search query enhancement.

Fremium APIs are available at https://www.rest-apis.com/

If you need test data to run your model agains the counterpart data set is https://www.kaggle.com/datasets/bpmtips/target-domains-for-iab-textwebsite-classification

Have Fun !!

Let us know if you exceed the accuracy of our model.

uploaded a new version with only english language sites.
D
CompuCrawl: Full database and code
dataverse.nl
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
Explore at:
Unique identifier
https://doi.org/10.34894/OBVAOY
Dataset updated
Sep 23, 2025
Dataset provided by
DataverseNL
Authors
Richard Haans; Richard Haans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
d
Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...
datarade.ai
.json, .csv
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataplex (2024). Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends, audience insights + more | Ideal for Interest-Based Segmentation [Dataset]. https://datarade.ai/data-products/dataplex-reddit-data-global-social-media-data-1-1m-mill-dataplex
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 12, 2024
Dataset authored and provided by
Dataplex
Area covered
Botswana, Holy See, Chile, Côte d'Ivoire, Gambia, Macao, Christmas Island, Mexico, Jersey, Martinique
Description
The Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.

Dataset Overview:

This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.

2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.

Sourced Directly from Reddit:

All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.

Key Features:

Subreddit Metrics: Detailed data on subreddit activity, including the number of posts, comments, votes, and user participation.

User Engagement: Insights into how users interact with content, including comment threads, upvotes/downvotes, and participation rates.

Trending Topics: Track emerging trends and viral content across the platform, helping you stay ahead of the curve in understanding social media dynamics.

AI-Enhanced Analysis: Utilize AI-generated columns for sentiment analysis, topic categorization, and predictive insights, providing a deeper understanding of the data.

Use Cases:

Social Media Analysis: Researchers and analysts can use this dataset to study online behavior, track the spread of information, and understand how content resonates with different audiences.

Market Research: Marketers can leverage the dataset to identify target audiences, understand consumer preferences, and tailor campaigns to specific communities.

Content Strategy: Content creators and strategists can use insights from the dataset to craft content that aligns with trending topics and user interests, maximizing engagement.

Academic Research: Academics can explore the dynamics of online communities, studying everything from the spread of misinformation to the formation of online subcultures.

Data Quality and Reliability:

The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.

Integration and Usability:

The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.

User-Friendly Structure and Metadata:

The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.

Ideal For:

Data Analysts: Conduct in-depth analyses of subreddit trends, user engagement, and content virality. The dataset’s extensive coverage and AI-enhanced insights make it an invaluable tool for data-driven research.

Marketers: Use the dataset to better understand your target audience, tailor campaigns to specific interests, and track the effectiveness of marketing efforts across Reddit.

Researchers: Explore the social dynamics of online communities, analyze the spread of ideas and information, and study the impact of digital media on public discourse, all while leveraging AI-generated insights.

This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Company classification
kaggle.com
zip
Updated Mar 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CharanPuvvala (2020). Company classification [Dataset]. https://www.kaggle.com/charanpuvvala/company-classification
Explore at:
zip(127819010 bytes)Available download formats
Dataset updated
Mar 30, 2020
Authors
CharanPuvvala
Description
Context

Often we find in situation to classify businesses and companies across a standard taxonomy. This dataset comes with pre-classified companies along with data scraped from the website.

Content

The scraped data from the website includes, 1. Category: The target label into which the company is classified 2. website: The website of the company / business 3. company_name: The company / business name 4. homepage_text : Visible homepage text 5. h1: The heading 1 tags from the html of the home page 6. h2: The heading 2 tags from the html of the home page 7. h3: The heading 3 tags from the html of the home page 8. nav_link_text: The visible titles of navigation links on the homepage (Ex: Home, Services, Product, About Us, Contact Us) 9. meta_keywords: The meta keywords in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp) 10 meta_description: The meta description in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp)

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
Confusion matrix for classification of web-scraped clothing data
ons.gov.uk
cy.ons.gov.uk
csv
Updated Sep 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2020). Confusion matrix for classification of web-scraped clothing data [Dataset]. https://www.ons.gov.uk/economy/inflationandpriceindices/datasets/confusionmatrixforclassificationofwebscrapedclothingdata
Explore at:
csvAvailable download formats
Dataset updated
Sep 1, 2020
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
A confusion matrix can be used to compare a machine’s predictions against human classification. We can use confusion matrices to understand the consumption segments that the classifier is struggling to distinguish between. A confusion matrix for our XGBoost classification of web-scraped clothing data is available in this data download.
d
Business Website Visits Data | USA Coverage | Industry/Context...
datarade.ai
.json, .csv, .txt
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BIGDBM (2024). Business Website Visits Data | USA Coverage | Industry/Context Categorisation - Training Set for ML and AI [Dataset]. https://datarade.ai/data-products/bigdbm-website-visits-data-with-industry-context-categorizati-bigdbm
Explore at:
.json, .csv, .txtAvailable download formats
Dataset updated
Jan 7, 2024
Dataset authored and provided by
BIGDBM
Area covered
United States of America
Description
Website visit data with URLs, categories, timestamps, and anonymized unique device identifiers.

Over 50 million unique devices per day. 1 billion+ raw signals per month with historical raw data available.

This data can be combined with demographic and lifestyle data to provide a richer view of the anonymous users/devices.

Intended for training ML and AI models.
G
Web Filtering Software Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Web Filtering Software Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/web-filtering-software-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Web Filtering Software Market Outlook

According to our latest research, the global Web Filtering Software market size reached USD 5.42 billion in 2024 and is anticipated to grow at a robust CAGR of 14.1% during the forecast period. By 2033, the market is projected to achieve a value of USD 18.14 billion. This significant growth is driven by increasing concerns about cybersecurity threats, the proliferation of internet-connected devices, and the need for organizations to comply with stringent regulatory requirements. The rising adoption of cloud computing and the expansion of remote workforces are further fueling the demand for advanced web filtering solutions worldwide.

A primary growth factor for the web filtering software market is the escalating frequency and sophistication of cyberattacks targeting both enterprises and individuals. As organizations digitize their operations and employees access corporate resources remotely, the attack surface has expanded considerably. Threat actors are leveraging advanced phishing schemes, malware, and ransomware campaigns, making it essential for businesses to deploy comprehensive web filtering solutions. These tools help block access to malicious websites, prevent data exfiltration, and enforce acceptable use policies, thereby reducing the risk of breaches and safeguarding sensitive information. The increasing awareness of these threats among enterprises and the public sector is translating into higher investments in web filtering technologies.

Another significant driver is the growing emphasis on regulatory compliance across various industries, especially those handling sensitive data such as BFSI, healthcare, and government. Regulations like the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and the Payment Card Industry Data Security Standard (PCI DSS) require organizations to implement robust security controls, including web filtering, to protect personal and financial data. Non-compliance can result in hefty fines and reputational damage, prompting organizations to adopt advanced web filtering software as part of their broader cybersecurity strategy. This regulatory push is particularly pronounced in developed regions, but is also gaining traction in emerging markets as digitalization accelerates.

The surge in remote and hybrid work models has further accelerated the adoption of web filtering software. With employees accessing corporate networks from various locations and devices, traditional perimeter-based security approaches are no longer sufficient. Organizations are increasingly turning to cloud-based web filtering solutions that offer centralized policy management, real-time threat intelligence, and seamless scalability. These solutions enable businesses to maintain consistent security postures regardless of where employees are located, ensuring productivity and data protection in a highly distributed work environment. The flexibility and cost-effectiveness of cloud-based offerings are attracting both large enterprises and small and medium enterprises (SMEs).

Web Categorization Services play a crucial role in enhancing the effectiveness of web filtering software. By classifying websites into various categories, these services enable organizations to implement more granular and precise filtering policies. This categorization helps in blocking access to inappropriate or harmful content while allowing access to legitimate and necessary resources. As the internet continues to grow and evolve, the ability to accurately categorize web content is becoming increasingly important for businesses and institutions aiming to maintain a secure and productive online environment. The integration of Web Categorization Services with web filtering solutions ensures that organizations can adapt to new threats and content trends, providing a dynamic and responsive approach to cybersecurity.

From a regional perspective, North America currently dominates the web filtering software market, accounting for the largest revenue share in 2024. The regionÂ’s leadership is attributed to the high adoption of digital technologies, a strong focus on cybersecurity, and the presence of leading market players. However, Asia Pacific is expected to witness the fastest growth during the forecast period, driven by rapid digital transformation in

AIToolBuzz.com: 16K+ AI Tools Database

kaggle.com

zip

Updated Oct 25, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

devadigax (2025). AIToolBuzz.com: 16K+ AI Tools Database [Dataset]. https://www.kaggle.com/datasets/devadigax/aitoolbuzz-com-16k-ai-tools-database

Explore at:

zip(2258248 bytes)Available download formats

Dataset updated

Oct 25, 2025

Authors

devadigax

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

🧠 About Dataset

Overview

The AIToolBuzz — 16,763 AI Tools Dataset is a comprehensive collection of publicly available information on artificial intelligence tools and platforms curated from AIToolBuzz.com.
It compiles detailed metadata about each tool, including name, description, category, founding year, technologies used, website, and operational status.

The dataset serves as a foundation for AI trend analysis, product discovery, market research, and NLP-based categorization projects.
It enables researchers, developers, and analysts to explore the evolution of AI tools, detect emerging sectors, and study keyword trends across industries.

Dataset Composition

Total Entries: 16,763 AI tools
Time Period: Data collected in October 2025
Source: AIToolBuzz.com — a curated directory of AI products and services
Format: CSV (comma-separated), UTF-8 encoded
Columns: 13 descriptive fields covering both tool metadata and website status

Column	Description
Name	Tool’s official name
Link	URL of its page on AIToolBuzz
Logo	Direct logo image URL
Category	Functional domain (e.g., Communication, Marketing, Development)
Primary Task	Main purpose or capability
Keywords	Comma-separated tags describing tool functions and industries
Year Founded	Year of company/tool inception
Short Description	Concise summary of the tool
Country	Headquarters or operating country
industry	Industry classification
technologies	Key technologies or frameworks associated
Website	Official product/company website
Website Status	Website availability (Active / Error / Not Reachable / etc.)

Use Cases

🧩 Market & Trend Analysis — Examine growth and patterns in AI categories, technologies, and geographies.
🤖 NLP & ML Projects — Use keywords and descriptions for text clustering or embedding tasks.
🏷️ Tool Discovery & Classification — Build AI tool recommenders or taxonomies.
📊 Data Visualization — Create dashboards showing trends over time or by region.

Example Entries

Name	Category	Year Founded	Country	Website Status
ChatGPT	Communication and Support	2022	Estonia	Active
Claude	Operations and Management	2023	United States	Active

Provenance Summary

Source: AIToolBuzz.com — public web directory.
Collection Method: Automated web scraping via requests + BeautifulSoup, extracting metadata from each tool’s public page.
Date Collected: October 2025.
License: Derived dataset — redistribution permitted with attribution (CC BY 4.0 recommended).
Collector: Swathik Devadiga.
Frequency: Planned quarterly updates.

Citation

If you use this dataset, please cite as: AIToolBuzz — 16,763 AI Tools (Complete Directory with Metadata). Kaggle. https://aitoolbuzz.com

License

License: CC BY 4.0 — Creative Commons Attribution 4.0 International

You are free to share and adapt the data for research or analysis with proper attribution to AIToolBuzz.com as the original source.

Z
Data from: Classification of web-based Digital Humanities projects...
data-staging.niaid.nih.gov
zenodo.org
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Battisti, Tommaso (2024). Classification of web-based Digital Humanities projects leveraging information visualisation techniques [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14192757
Explore at:
Dataset updated
Nov 28, 2024
Dataset provided by
University of Bologna
Authors
Battisti, Tommaso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This dataset contains a list of 186 Digital Humanities projects leveraging information visualisation methods. Each project has been classified according to visualisation and interaction techniques, narrativity and narrative solutions, domain, methods for the representation of uncertainty and interpretation, and the employment of critical and custom approaches to visually represent humanities data.

Classification schema: categories and columns

The project_id column contains unique internal identifiers assigned to each project. Meanwhile, the last_access column records the most recent date (in DD/MM/YYYY format) on which each project was reviewed based on the web address specified in the url column.The remaining columns can be grouped into descriptive categories aimed at characterising projects according to different aspects:

Narrativity. It reports the presence of narratives employing information visualisation techniques. Here, the term narrative encompasses both author-driven linear data stories and more user-directed experiences where the narrative sequence is composed of user exploration [1]. We define 2 columns to identify projects using visualisation techniques in narrative, or non-narrative sections. Both conditions can be true for projects employing visualisations in both contexts. Columns:

non_narrative (boolean)

narrative (boolean)

Domain. The humanities domain to which the project is related. We rely on [2] and the chapters of the first part of [3] to abstract a set of general domains. Column:

domain (categorical):

History and archaeology

Art and art history

Language and literature

Music and musicology

Multimedia and performing arts

Philosophy and religion

Other: both extra-list domains and cases of collections without a unique or specific thematic focus.

Visualisation of uncertainty and interpretation. Buiding upon the frameworks proposed by [4] and [5], a set of categories was identified, highlighting a distinction between precise and impressional communication of uncertainty. Precise methods explicitly represent quantifiable uncertainty such as missing, unknown, or uncertain data, precisely locating and categorising it using visual variables and positioning. Two sub-categories are interactive distinction, when uncertain data is not visually distinguishable from the rest of the data but can be dynamically isolated or included/excluded categorically through interaction techniques (usually filters); and visual distinction, when uncertainty visually “emerges” from the representation by means of dedicated glyphs and spatial or visual cues and variables. On the other hand, impressional methods communicate the constructed and situated nature of data [6], exposing the interpretative layer of the visualisation and indicating more abstract and unquantifiable uncertainty using graphical aids or interpretative metrics. Two sub-categories are: ambiguation, when the use of graphical expedients—like permeable glyph boundaries or broken lines—visually convey the ambiguity of a phenomenon; and interpretative metrics, when expressive, non-scientific, or non-punctual metrics are used to build a visualisation. Column:

uncertainty_interpretation (categorical):

Interactive distinction

Visual distinction

Ambiguation

Interpretative metrics

Critical adaptation. We identify projects in which, for what concerns at least a visualisation, the following criteria are fulfilled: 1) avoid uncritical repurposing of prepackaged, generic-use, or ready-made solutions; 2) being tailored and unique to reflect the peculiarities of the phenomena at hand; 3) avoid extreme simplifications to embraces and depict complexity promoting time-spending visualisation-based inquiry. Column:

critical_adaptation (boolean)

Non-temporal visualisation techniques. We adopt and partially adapt the terminology and definitions from [7]. A column is defined for each type of visualisation and accounts for its presence within a project, also including stacked layouts and more complex variations. Columns and inclusion criteria:

plot (boolean): visual representations that map data points onto a two-dimensional coordinate system.

cluster_or_set (bool): sets or cluster-based visualisations used to unveil possible inter-object similarities.

map (boolean): geographical maps used to show spatial insights. While we do not specify the variants of maps (e.g., pin maps, dot density maps, flow maps, etc.), we make an exception for maps where each data point is represented by another visualisation (e.g., a map where each data point is a pie chart) by accounting for the presence of both in their respective columns.

network (boolean): visual representations highlighting relational aspects through nodes connected by links or edges.

hierarchical_diagram (boolean): tree-like structures such as tree diagrams, radial trees, but also dendrograms. They differ from networks for their strictly hierarchical structure and absence of closed connection loops.

treemap (boolean): still hierarchical, but highlighting quantities expressed by means of area size. It also includes circle packing variants.

word_cloud (boolean): clouds of words, where each instance’s size is proportional to its frequency in a related context

bars (boolean): includes bar charts, histograms, and variants. It coincides with “bar charts” in [7] but with a more generic term to refer to all bar-based visualisations.

line_chart (boolean): the display of information as sequential data points connected by straight-line segments.

area_chart (boolean): similar to a line chart but with a filled area below the segments. It also includes density plots.

pie_chart (boolean): circular graphs divided into slices which can also use multi-level solutions.

plot_3d (boolean): plots that use a third dimension to encode an additional variable.

proportional_area (boolean): representations used to compare values through area size. Typically, using circle- or square-like shapes.

other (boolean): it includes all other types of non-temporal visualisations that do not fall into the aforementioned categories.

Temporal visualisations and encodings. In addition to non-temporal visualisations, a group of techniques to encode temporality is considered in order to enable comparisons with [7]. Columns:

timeline (boolean): the display of a list of data points or spans in chronological order. They include timelines working either with a scale or simply displaying events in sequence. As in [7], we also include structured solutions resembling Gantt chart layouts.

temporal_dimension (boolean): to report when time is mapped to any dimension of a visualisation, with the exclusion of timelines. We use the term “dimension” and not “axis” as in [7] as more appropriate for radial layouts or more complex representational choices.

animation (boolean): temporality is perceived through an animation changing the visualisation according to time flow.

visual_variable (boolean): another visual encoding strategy is used to represent any temporality-related variable (e.g., colour).

Interaction techniques. A set of categories to assess affordable interaction techniques based on the concept of user intent [8] and user-allowed data actions [9]. The following categories roughly match the “processing”, “mapping”, and “presentation” actions from [9] and the manipulative subset of methods of the “how” an interaction is performed in the conception of [10]. Only interactions that affect the visual representation or the aspect of data points, symbols, and glyphs are taken into consideration. Columns:

basic_selection (boolean): the demarcation of an element either for the duration of the interaction or more permanently until the occurrence of another selection.

advanced_selection (boolean): the demarcation involves both the selected element and connected elements within the visualisation or leads to brush and link effects across views. Basic selection is tacitly implied.

navigation (boolean): interactions that allow moving, zooming, panning, rotating, and scrolling the view but only when applied to the visualisation and not to the web page. It also includes “drill” interactions (to navigate through different levels or portions of data detail, often generating a new view that replaces or accompanies the original) and “expand” interactions generating new perspectives on data by expanding and collapsing nodes.

arrangement (boolean): methods to organise visualisation elements (symbols, glyphs, etc.) or multi-visualisation layouts spatially through drag and drop or according to a criterion via more automatic triggers.

change (boolean): visual encoding alterations involving different aspects of visualisation as a whole: the same content is presented with another visualisation technique; the change involves symbols or glyphs aspect (colour, size, shape, etc.); the visualisation type is unaltered, but the layout variant changes (e.g., to stacked layouts); or other changes like axes inversion and scale modifications. The presence of all the visualisation techniques involved in a change is reported.

visualisation_filter (boolean): filters to exclude or include visualisation elements with respect to defined criteria, without reloading or generating a new visualisation. Unlike options triggering the fetch of new data to alter the visualisation content, filters seamlessly operate on existing visual elements.

collection_filter (boolean): the interaction with visualised elements acts as a filter for a related collection or list of items (e.g., clicking a region on a map filters a list of items according to spatial metadata).

aggregation (boolean): changes to the granularity of visual elements according to a
arXiv publications dataset with simulated citation relationships
figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacek Miecznikowski; Dominik Tomaszuk (2023). arXiv publications dataset with simulated citation relationships [Dataset]. http://doi.org/10.6084/m9.figshare.6449756.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6449756.v1
Dataset updated
Jun 5, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jacek Miecznikowski; Dominik Tomaszuk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
arXiv publications dataset with simulated citation relationshipshttps://github.com/jacekmiecznikowski/neo4index App evaluates scientific reasearch impact using author-level metrics (h-index and more)This collection contains data aquired from arXiv.org via OAI2 protocol.arXiv does not provide citations metadata so this data was pseudo-randomly simulated.We evaluated scientific reasearch impact using six popular author-level metrics:* h-index,* m quotient,* e-index,* m-index,* r-index,* ar-index.Sourcehttps://arxiv.org/help/bulk_data (downloaded: 2018-03-23; over 1.3 million publications)Files* arxiv_bulk_metadata_2018-03-23.tar.gz - file downloaded using oai-harvester contains metadata of all arXiv publications to date.* categories.csv - file contains categories from arXiv with category-subcategory division* publications.csv - file contains information about articles like: id, title, abstract, url, categories and date* authors.csv - file contains authors data like first name, last name and id of published article* citations.csv - file contains simulated relationships between all publications using arxivCite* indices.csv - file contains 6 author-level metrics calculated on database using neo4indexStatisticsh-index Average = 3.5836524733724495m quotient Average = 0.5831426366846965e-index Average = 7.9260187734579075m-index Average = 29.436844659143155r-index Average = 8.931101630575293ar-index Average = 3.5439082808721025h-index Median = 1.0m quotient Median = 0.4167e-index Median = 5.3852m-index Median = 17.0r-index Median = 5.831ar-index Median = 2.7928h-index Mode = 1.0m quotient Mode = 1.0e-index Mode = 0.0m-index Mode = 0.0r-index Mode = 0.0ar-index Mode = 0.0
h
lightspeed-categories
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
smartfoloo (2025). lightspeed-categories [Dataset]. https://huggingface.co/datasets/smartfoloo/lightspeed-categories
Explore at:
Dataset updated
Aug 31, 2025
Authors
smartfoloo
License
https://choosealicense.com/licenses/mpl-2.0/https://choosealicense.com/licenses/mpl-2.0/
Description
Lightspeed Categories

This is a database of the categories of the Lightspeed Filter. You can use this with the Lightspeed API, which will provide you the category number for a URL.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hetul Mehta (2021). Website Classification [Dataset]. https://www.kaggle.com/hetulmehta/website-classification

Website Classification

classify website URLs to different categories

Explore at:

zip(2094838 bytes)Available download formats

Dataset updated

May 5, 2021

Authors

Hetul Mehta

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

This dataset was created by scraping different websites and then classifying them into different categories based on the extracted text.

Content

Below are the values each column has. The column names are pretty self-explanatory. website_url: URL link of the website. cleaned_website_text: the cleaned text content extracted from the

Clear search

Close search

Google apps

Main menu

Website Classification

Context

Content

Website Classification Dataset

URL Classification Dataset for Malicious Traffic

URL Classification Dataset for Malicious Traffic Detection

Dataset Overview

Dataset Composition

Source of Data

Application and Importance

URL CLASSIFICATION DMOZ

Context

Content

Acknowledgements

Web Categorization Services Market Research Report 2033

Web Categorization Services Market Outlook

Soil Series Classification Database (SC)

Web Data Commons - Categorization Gold Standard

Web Categorization Services Market Research Report 2033

Web Categorization Services Market Outlook

Component Analysis

Dataset for IAB text/website classification

CompuCrawl: Full database and code

Dataplex: Reddit Data | Global Social Media Data | 2.1M+ subreddits: trends,...

Company classification

Context

Content

Inspiration

Protein Structural Domain Classification

Confusion matrix for classification of web-scraped clothing data

Business Website Visits Data | USA Coverage | Industry/Context...

Web Filtering Software Market Research Report 2033

Web Filtering Software Market Outlook

AIToolBuzz.com: 16K+ AI Tools Database

🧠 About Dataset

Overview

Dataset Composition

Use Cases

Example Entries

Provenance Summary

Citation

License

License: CC BY 4.0 — Creative Commons Attribution 4.0 International

Data from: Classification of web-based Digital Humanities projects...

arXiv publications dataset with simulated citation relationships

lightspeed-categories

Website Classification

classify website URLs to different categories

Context

Content