100+ datasets found

c
Website Classification Dataset
cubig.ai
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Website Classification Dataset [Dataset]. https://cubig.ai/store/products/138/website-classification-dataset
Explore at:
Dataset updated
Feb 25, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.

2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.
Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...
zenodo.org
data.niaid.nih.gov
csv
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10413068
Dataset updated
Dec 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models

Research on LLM-generated datasets for text classification tasks

Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Phishing websites Data
kaggle.com
Updated Aug 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Nagariya (2020). Phishing websites Data [Dataset]. https://www.kaggle.com/aman9d/phishing-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Nagariya
Description
Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link
h
multi-label-web-categorization
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taimur, multi-label-web-categorization [Dataset]. https://huggingface.co/datasets/tshasan/multi-label-web-categorization
Explore at:
Authors
Taimur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multi-Label Web Page Classification Dataset

Dataset Description

The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content… See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.
Curlie Dataset - Language-agnostic Website Embedding and Classification
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sylvain Lugeon; Tiziano Piccardi (2023). Curlie Dataset - Language-agnostic Website Embedding and Classification [Dataset]. http://doi.org/10.6084/m9.figshare.19406693.v5
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19406693.v5
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Sylvain Lugeon; Tiziano Piccardi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
**************** Full Curlie dataset **************** Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples mapping.json.gz > [english_label, matchings] x 35,946 labels **************** Processed Curlie dataset **************** We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. curlie_filtered.csv.gz > [url, uid, label, lang] x 933,416 samples class_vector.json.gz > [uid, class_vector] x 885,582 samples html_content.json.gz > [uid, html] x 885,582 samples visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples class_names.txt > [class_name] x 14 classes train_uid.txt > [uid] x 797,023 samples test_uid.txt > [uid] x 88,559 samples **************** Enriched Curlie dataset **************** Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). outputs.json.gz > [uid, url, score, embedding] x 885,582 samples **************** Pretrained Homepage2Vec**************** h2v_1000_100.zip > Model pretrained on all features h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots) **************** Notes **************** CSV file can be read with python: import pandas as pd df = pd.read_csv(“curlie.csv.gz“, index_col=0) JSON files have one record per line and can be read with python: import json import gzip with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file: for line in file: data = json.loads(line) …
o
Website Categorisation Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Website Categorisation Dataset [Dataset]. https://www.opendatabay.com/data/dataset/42ebfeae-a971-4d33-af3d-41401587cd49
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Website Analytics & User Experience
Description
This dataset provides a collection of website URLs and their corresponding cleaned text content, which have been categorised into various topics. It is designed to facilitate website classification tasks, offering valuable insights for web analytics and user experience analysis. The data was created by extracting and cleaning text from different websites, then assigning categories based on this content.

Columns

index: An identifier for each row in the dataset.

website_url: The URL link of the website.

cleaned_website_text: The cleaned text content extracted from the website URL.

Category: The assigned category of the URL.

Distribution

The dataset comprises 1408 rows of data. It is typically available in a CSV file format. The categories present in the dataset include 'Education' (8%), 'Business/Corporate' (8%), and 'Other' (84%), reflecting a diverse range of website types. There are 1375 unique website URLs and 1407 unique categories.

Usage

This dataset is ideal for various applications, including: * Website classification: Training models to automatically assign categories to new websites. * Website analytics: Understanding the topical distribution of websites. * User experience studies: Analysing website content for improved user engagement. * Data visualisation: Creating visual representations of website categories. * Natural Language Processing (NLP) tasks: Developing and testing NLP models for text extraction and categorisation. * Multiclass classification problems: Serving as a foundation for building complex classification algorithms.

Coverage

The dataset offers global coverage, encompassing websites from various regions.

License

CCO

Who Can Use It

This dataset is suitable for: * Beginner data scientists and analysts looking to practice classification, NLP, and data visualisation. * Machine learning engineers developing and testing multiclass classification models. * Researchers interested in web content analysis and automatic categorisation. * Developers building applications that require website categorisation capabilities.

Dataset Name Suggestions

Website Categorisation Dataset

Web Content Classification

URL Classification Data

Cleaned Website Text Categories

Web Page Classification Repository

Attributes

Original Data Source: Website Classification
Phishing Website HTML Classification
kaggle.com
Updated Apr 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hunter Kempf (2022). Phishing Website HTML Classification [Dataset]. https://www.kaggle.com/datasets/huntingdata11/phishing-website-html-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hunter Kempf
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.

Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/
i
Netflix
ieee-dataport.org
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danil Shamsimukhametov (2021). Netflix [Dataset]. https://ieee-dataport.org/documents/youtube-netflix-web-dataset-encrypted-traffic-classification
Explore at:
Dataset updated
Oct 1, 2021
Authors
Danil Shamsimukhametov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
YouTube flows
Language-agnostic Website Embedding and Classification (Curlie dataset)
figshare.com
txt
Updated May 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiziano Piccardi; Sylvain Lugeon (2022). Language-agnostic Website Embedding and Classification (Curlie dataset) [Dataset]. http://doi.org/10.6084/m9.figshare.16621669.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16621669.v2
Dataset updated
May 2, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tiziano Piccardi; Sylvain Lugeon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset moved to: https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693
h
web_archive_classification
huggingface.co
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Library (2025). web_archive_classification [Dataset]. https://huggingface.co/datasets/TheBritishLibrary/web_archive_classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2025
Dataset authored and provided by
British Library
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy.
o
Phishing URL Classifier Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Phishing URL Classifier Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/705b35a9-e638-462d-a5e1-d9f70ff4234a
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Website Analytics & User Experience
Description
This dataset is a curated collection of over 800,000 URLs, designed to represent a variety of online domains. Approximately 52% of these domains are identified as legitimate entities, while the remaining 47% are categorised as phishing domains, indicating potential online threats. The dataset consists of two key columns: "url" and "status". The "status" column uses binary encoding, where 0 signifies phishing domains and 1 indicates legitimate domains. This balanced distribution between phishing and legitimate instances helps ensure the dataset's robustness for analysis and model development.

Columns

url: This field contains the Uniform Resource Locators (URLs) for each domain, including both legitimate and phishing entries.

status: This field denotes the classification of the URL. A value of 0 represents a phishing domain, indicating a potential risk, while a value of 1 signifies a legitimate domain, offering assurance.

Distribution

The dataset is provided in a CSV file format. It contains 808,042 unique entries. The distribution of statuses is approximately 394,982 entries flagged as phishing (0) and 427,028 entries flagged as legitimate (1). This offers an almost equal balance across the two categories.

Usage

This dataset is ideal for applications aimed at understanding, combating, and mitigating online threats. It can be used for developing models related to phishing detection, binary classification, and website analytics. It is also suitable for data cleaning exercises and projects involving Natural Language Processing (NLP) and Deep Learning.

Coverage

The data collection for this dataset is global in scope. While a specific time range for data collection is not provided, the dataset was listed on 05/06/2025.

License

CCO

Who Can Use It

This dataset is particularly valuable for researchers and practitioners working in the fields of AI and Machine Learning. Intended users include those looking to: * Develop and train models for identifying malicious URLs. * Analyse patterns distinguishing legitimate websites from phishing attempts. * Enhance cybersecurity measures and protect users from online threats.

Dataset Name Suggestions

URL Phishing Detection

Legitimate and Malicious URLs

Online Threat URL Dataset

Phishing URL Classifier Data

Web Security URL Collection

Attributes

Original Data Source: Phishing and Legitimate URLS
Data from: HTTPS traffic classification
kaggle.com
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Đinh Ngọc Ân (2024). HTTPS traffic classification [Dataset]. https://www.kaggle.com/datasets/inhngcn/https-traffic-classification/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Đinh Ngọc Ân
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The people from Czech are publishing a dataset for the HTTPS traffic classification.

Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:

Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
Data from: Tree of Life Web Project Classification
gbif.org
Updated Jul 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Maddison; David Maddison (2025). Tree of Life Web Project Classification [Dataset]. http://doi.org/10.15468/ibltvp
Explore at:
Unique identifier
https://doi.org/10.15468/ibltvp
Dataset updated
Jul 13, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Tree of Life Web Projecthttp://www.tolweb.org/
Authors
David Maddison; David Maddison
Description
The Tree of Life Web Project (ToL) is a collaborative effort of biologists and nature enthusiasts from around the world. On more than 10,000 World Wide Web pages, the project provides information about biodiversity, the characteristics of different groups of organisms, and their evolutionary history (phylogeny).
d
Land-water classification for selected sites in McFaddin NWR and J.D....
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Land-water classification for selected sites in McFaddin NWR and J.D. Murphree WMA [Dataset]. https://catalog.data.gov/dataset/land-water-classification-for-selected-sites-in-mcfaddin-nwr-and-j-d-murphree-wma
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
J.D. Murphree Wildlife Management Area
Description
Land-water data was derived from imagery acquired at 350 feet using unmanned aerial systems (UAS) for 6 separate study locations using the Ricoh GR II camera. Three sites are healthy marsh and three sites are degraded marshes. For each study site, ground control markers were established and surveyed in using Real Time Kinematic (RTK) survey equipment. The imagery collected has been processed to produce a land-water classification dataset for scientific research. The land-water data will not only quantify how much marsh is being affected, but the data will also provide a spatial aspect as to where these degrading marsh fragmentations are occurring. The land-water data will be correlated with other data such as salinity, prescribed burns, flooding frequency and flooding duration data to better understand what events may be causing marsh deterioration. With low resolution, vegetation types do not cause any troubling issues with classification but due to the high resolution of the imagery (1.18 inches/0.03 meters) there will be inherent “noise” that causes speckling throughout the classified image. With the image resolution at such a small Ground Sample Distance (GSD), the smallest of information will be visible. These small pieces of information that we call “noise” will be introduced into our image classification and will mostly come from vegetation shadows and some water saturation. In this study, we are attempting to identify hollows which are low areas or holes in the vegetation which may suggest a degradation of adjacent marsh. For our study analysis, a hollow is defined as an area that is .25m * .25m = 0.0625m2 (69 pixels) or greater. Any cluster of cells smaller than 69 pixels will be absorbed into the surrounding vegetation type. This method will help reduce noise and maintain confidence in the hollow identification.
o
EXTREME LEARNING MACHINE FOR CLASSIFICATION OF PHISHING WEBSITES FEATURES
osf.io
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IAEME; Sumeshwar Singh (2023). EXTREME LEARNING MACHINE FOR CLASSIFICATION OF PHISHING WEBSITES FEATURES [Dataset]. http://doi.org/10.17605/OSF.IO/3K47V
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/3K47V
Dataset updated
Apr 28, 2023
Dataset provided by
Center For Open Science
Authors
IAEME; Sumeshwar Singh
Description
No description was included in this Dataset collected from the OSF
Confusion matrix for classification of web-scraped clothing data
ons.gov.uk
cy.ons.gov.uk
csv
Updated Sep 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2020). Confusion matrix for classification of web-scraped clothing data [Dataset]. https://www.ons.gov.uk/economy/inflationandpriceindices/datasets/confusionmatrixforclassificationofwebscrapedclothingdata
Explore at:
csvAvailable download formats
Dataset updated
Sep 1, 2020
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
A confusion matrix can be used to compare a machine’s predictions against human classification. We can use confusion matrices to understand the consumption segments that the classifier is struggling to distinguish between. A confusion matrix for our XGBoost classification of web-scraped clothing data is available in this data download.
Dataset for Web Page Classification
zenodo.org
csv
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Audrone Janaviciute; Audrone Janaviciute; Agnius Liutkevicius; Agnius Liutkevicius; Nerijus Morkevicius; Nerijus Morkevicius (2025). Dataset for Web Page Classification [Dataset]. http://doi.org/10.5281/zenodo.15828693
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15828693
Dataset updated
Jul 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Audrone Janaviciute; Audrone Janaviciute; Agnius Liutkevicius; Agnius Liutkevicius; Nerijus Morkevicius; Nerijus Morkevicius
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 7, 2025
Description
This dataset contains 2648 records of web sites of different categories. The first column contains an URL for the web site, while the second column contains the web site category index and name. There are 7 categories in total:

0 – Business (508); 1 – Education (394); 2 – Adult (115); 3 – Games (385); 4 – Health (456); 5 – Sport (299); 6 – Travel (491).

Please note that some URLs can become unavailable over time.
Company classification
kaggle.com
Updated Mar 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CharanPuvvala (2020). Company classification [Dataset]. https://www.kaggle.com/charanpuvvala/company-classification/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
CharanPuvvala
Description
Context

Often we find in situation to classify businesses and companies across a standard taxonomy. This dataset comes with pre-classified companies along with data scraped from the website.

Content

The scraped data from the website includes, 1. Category: The target label into which the company is classified 2. website: The website of the company / business 3. company_name: The company / business name 4. homepage_text : Visible homepage text 5. h1: The heading 1 tags from the html of the home page 6. h2: The heading 2 tags from the html of the home page 7. h3: The heading 3 tags from the html of the home page 8. nav_link_text: The visible titles of navigation links on the homepage (Ex: Home, Services, Product, About Us, Contact Us) 9. meta_keywords: The meta keywords in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp) 10 meta_description: The meta description in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp)

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
o
Data set of the article: Using Machine Learning for Web Page Classification...
explore.openaire.eu
Updated Jan 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Goran Matošević; Jasminka Dobša; Dunja Mladenić (2021). Data set of the article: Using Machine Learning for Web Page Classification in Search Engine Optimization [Dataset]. http://doi.org/10.5281/zenodo.4416123
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4416123
Dataset updated
Jan 4, 2021
Authors
Goran Matošević; Jasminka Dobša; Dunja Mladenić
Description
Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization" Abstract of the article: This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.
Semantic Web Parsing Dataset Part I
kaggle.com
Updated Apr 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swaroop Nath (2020). Semantic Web Parsing Dataset Part I [Dataset]. https://www.kaggle.com/swaroopnath6/semantic-web-parsing-dataset-part-i
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Swaroop Nath
Description
Why I came up with the dataset?

The pool of information on the internet is increasing constantly, which makes it difficult for normal users to crawl out the important information. This dataset (hyperlink here) is meant to parse out the relevant content of a website into a textual format, making it easy for users to understand the information.

What does it have?

The dataset contains features of various tags that have been gathered from a various number of informative websites. For more information on how the data was gathered, please visit - https://github.com/swaroop-nath/Semantic-Web-Parser (development branch).

Facebook

Twitter

Click to copy link

Link copied

Cite

CUBIG (2025). Website Classification Dataset [Dataset]. https://cubig.ai/store/products/138/website-classification-dataset

Website Classification Dataset

Explore at:

Dataset updated

Feb 25, 2025

Dataset authored and provided by

CUBIG

License

https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

Measurement technique

Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training

Description

1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.

2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.

Clear search

Close search

Google apps

Main menu

Website Classification Dataset

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

Phishing websites Data

multi-label-web-categorization

Curlie Dataset - Language-agnostic Website Embedding and Classification

Website Categorisation Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Phishing Website HTML Classification

Netflix

Language-agnostic Website Embedding and Classification (Curlie dataset)

web_archive_classification

Phishing URL Classifier Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data from: HTTPS traffic classification

Data from: Tree of Life Web Project Classification

Land-water classification for selected sites in McFaddin NWR and J.D....

EXTREME LEARNING MACHINE FOR CLASSIFICATION OF PHISHING WEBSITES FEATURES

Confusion matrix for classification of web-scraped clothing data

Dataset for Web Page Classification

Company classification

Context

Content

Inspiration

Data set of the article: Using Machine Learning for Web Page Classification...

Semantic Web Parsing Dataset Part I

Why I came up with the dataset?

What does it have?

Website Classification Dataset