https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.
2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
Dataset Composition:
Intended Use:
Additional Information:
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multi-Label Web Page Classification Dataset
Dataset Description
The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content… See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
**************** Full Curlie dataset **************** Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples mapping.json.gz > [english_label, matchings] x 35,946 labels **************** Processed Curlie dataset **************** We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. curlie_filtered.csv.gz > [url, uid, label, lang] x 933,416 samples class_vector.json.gz > [uid, class_vector] x 885,582 samples html_content.json.gz > [uid, html] x 885,582 samples visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples class_names.txt > [class_name] x 14 classes train_uid.txt > [uid] x 797,023 samples test_uid.txt > [uid] x 88,559 samples **************** Enriched Curlie dataset **************** Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). outputs.json.gz > [uid, url, score, embedding] x 885,582 samples **************** Pretrained Homepage2Vec**************** h2v_1000_100.zip > Model pretrained on all features h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots) **************** Notes **************** CSV file can be read with python: import pandas as pd df = pd.read_csv(“curlie.csv.gz“, index_col=0) JSON files have one record per line and can be read with python: import json import gzip with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file: for line in file: data = json.loads(line) …
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of website URLs and their corresponding cleaned text content, which have been categorised into various topics. It is designed to facilitate website classification tasks, offering valuable insights for web analytics and user experience analysis. The data was created by extracting and cleaning text from different websites, then assigning categories based on this content.
The dataset comprises 1408 rows of data. It is typically available in a CSV file format. The categories present in the dataset include 'Education' (8%), 'Business/Corporate' (8%), and 'Other' (84%), reflecting a diverse range of website types. There are 1375 unique website URLs and 1407 unique categories.
This dataset is ideal for various applications, including: * Website classification: Training models to automatically assign categories to new websites. * Website analytics: Understanding the topical distribution of websites. * User experience studies: Analysing website content for improved user engagement. * Data visualisation: Creating visual representations of website categories. * Natural Language Processing (NLP) tasks: Developing and testing NLP models for text extraction and categorisation. * Multiclass classification problems: Serving as a foundation for building complex classification algorithms.
The dataset offers global coverage, encompassing websites from various regions.
CCO
This dataset is suitable for: * Beginner data scientists and analysts looking to practice classification, NLP, and data visualisation. * Machine learning engineers developing and testing multiclass classification models. * Researchers interested in web content analysis and automatic categorisation. * Developers building applications that require website categorisation capabilities.
Original Data Source: Website Classification
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.
Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
YouTube flows
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset moved to: https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a curated collection of over 800,000 URLs, designed to represent a variety of online domains. Approximately 52% of these domains are identified as legitimate entities, while the remaining 47% are categorised as phishing domains, indicating potential online threats. The dataset consists of two key columns: "url" and "status". The "status" column uses binary encoding, where 0 signifies phishing domains and 1 indicates legitimate domains. This balanced distribution between phishing and legitimate instances helps ensure the dataset's robustness for analysis and model development.
The dataset is provided in a CSV file format. It contains 808,042 unique entries. The distribution of statuses is approximately 394,982 entries flagged as phishing (0) and 427,028 entries flagged as legitimate (1). This offers an almost equal balance across the two categories.
This dataset is ideal for applications aimed at understanding, combating, and mitigating online threats. It can be used for developing models related to phishing detection, binary classification, and website analytics. It is also suitable for data cleaning exercises and projects involving Natural Language Processing (NLP) and Deep Learning.
The data collection for this dataset is global in scope. While a specific time range for data collection is not provided, the dataset was listed on 05/06/2025.
CCO
This dataset is particularly valuable for researchers and practitioners working in the fields of AI and Machine Learning. Intended users include those looking to: * Develop and train models for identifying malicious URLs. * Analyse patterns distinguishing legitimate websites from phishing attempts. * Enhance cybersecurity measures and protect users from online threats.
Original Data Source: Phishing and Legitimate URLS
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The people from Czech are publishing a dataset for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
The Tree of Life Web Project (ToL) is a collaborative effort of biologists and nature enthusiasts from around the world. On more than 10,000 World Wide Web pages, the project provides information about biodiversity, the characteristics of different groups of organisms, and their evolutionary history (phylogeny).
Land-water data was derived from imagery acquired at 350 feet using unmanned aerial systems (UAS) for 6 separate study locations using the Ricoh GR II camera. Three sites are healthy marsh and three sites are degraded marshes. For each study site, ground control markers were established and surveyed in using Real Time Kinematic (RTK) survey equipment. The imagery collected has been processed to produce a land-water classification dataset for scientific research. The land-water data will not only quantify how much marsh is being affected, but the data will also provide a spatial aspect as to where these degrading marsh fragmentations are occurring. The land-water data will be correlated with other data such as salinity, prescribed burns, flooding frequency and flooding duration data to better understand what events may be causing marsh deterioration. With low resolution, vegetation types do not cause any troubling issues with classification but due to the high resolution of the imagery (1.18 inches/0.03 meters) there will be inherent “noise” that causes speckling throughout the classified image. With the image resolution at such a small Ground Sample Distance (GSD), the smallest of information will be visible. These small pieces of information that we call “noise” will be introduced into our image classification and will mostly come from vegetation shadows and some water saturation. In this study, we are attempting to identify hollows which are low areas or holes in the vegetation which may suggest a degradation of adjacent marsh. For our study analysis, a hollow is defined as an area that is .25m * .25m = 0.0625m2 (69 pixels) or greater. Any cluster of cells smaller than 69 pixels will be absorbed into the surrounding vegetation type. This method will help reduce noise and maintain confidence in the hollow identification.
No description was included in this Dataset collected from the OSF
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A confusion matrix can be used to compare a machine’s predictions against human classification. We can use confusion matrices to understand the consumption segments that the classifier is struggling to distinguish between. A confusion matrix for our XGBoost classification of web-scraped clothing data is available in this data download.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 2648 records of web sites of different categories. The first column contains an URL for the web site, while the second column contains the web site category index and name. There are 7 categories in total:
0 – Business (508); 1 – Education (394); 2 – Adult (115); 3 – Games (385); 4 – Health (456); 5 – Sport (299); 6 – Travel (491).
Please note that some URLs can become unavailable over time.
Often we find in situation to classify businesses and companies across a standard taxonomy. This dataset comes with pre-classified companies along with data scraped from the website.
The scraped data from the website includes, 1. Category: The target label into which the company is classified 2. website: The website of the company / business 3. company_name: The company / business name 4. homepage_text : Visible homepage text 5. h1: The heading 1 tags from the html of the home page 6. h2: The heading 2 tags from the html of the home page 7. h3: The heading 3 tags from the html of the home page 8. nav_link_text: The visible titles of navigation links on the homepage (Ex: Home, Services, Product, About Us, Contact Us) 9. meta_keywords: The meta keywords in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp) 10 meta_description: The meta description in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp)
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization" Abstract of the article: This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.
The pool of information on the internet is increasing constantly, which makes it difficult for normal users to crawl out the important information. This dataset (hyperlink here) is meant to parse out the relevant content of a website into a textual format, making it easy for users to understand the information.
The dataset contains features of various tags that have been gathered from a various number of informative websites. For more information on how the data was gathered, please visit - https://github.com/swaroop-nath/Semantic-Web-Parser (development branch).
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.
2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.