100+ datasets found
  1. c

    Website Classification Dataset

    • cubig.ai
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Website Classification Dataset [Dataset]. https://cubig.ai/store/products/138/website-classification-dataset
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.

    2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.

  2. Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    • LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
    • Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
    • Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    • curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
    • curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    • Fine-tuning and advancing Homepage2Vec or similar website classification models
    • Research on LLM-generated datasets for text classification tasks
    • Exploration of multilingual website classification

    Additional Information:

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  3. Phishing websites Data

    • kaggle.com
    Updated Aug 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Nagariya (2020). Phishing websites Data [Dataset]. https://www.kaggle.com/aman9d/phishing-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Nagariya
    Description

    Domain: The URL itself. Ranking: Page Ranking isIp: Is there an IP address in the weblink valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration. activeDuration: Also from whois API. Gives the duration of the time since the registration up until now. urlLen: It is simply the length of the URL is@: If the link has a '@' character then it's value = 1 isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together. haveDash: If there are any dashes in the domain name. domainLen: The length of just the domain name. noOfSubdomain: The number of subdomains preset in the URL. Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link

  4. h

    multi-label-web-categorization

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taimur, multi-label-web-categorization [Dataset]. https://huggingface.co/datasets/tshasan/multi-label-web-categorization
    Explore at:
    Authors
    Taimur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multi-Label Web Page Classification Dataset

      Dataset Description
    

    The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content… See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.

  5. Curlie Dataset - Language-agnostic Website Embedding and Classification

    • figshare.com
    application/gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sylvain Lugeon; Tiziano Piccardi (2023). Curlie Dataset - Language-agnostic Website Embedding and Classification [Dataset]. http://doi.org/10.6084/m9.figshare.19406693.v5
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sylvain Lugeon; Tiziano Piccardi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    **************** Full Curlie dataset **************** Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples mapping.json.gz > [english_label, matchings] x 35,946 labels **************** Processed Curlie dataset **************** We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. curlie_filtered.csv.gz > [url, uid, label, lang] x 933,416 samples class_vector.json.gz > [uid, class_vector] x 885,582 samples html_content.json.gz > [uid, html] x 885,582 samples visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples class_names.txt > [class_name] x 14 classes train_uid.txt > [uid] x 797,023 samples test_uid.txt > [uid] x 88,559 samples **************** Enriched Curlie dataset **************** Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). outputs.json.gz > [uid, url, score, embedding] x 885,582 samples **************** Pretrained Homepage2Vec**************** h2v_1000_100.zip > Model pretrained on all features h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots) **************** Notes **************** CSV file can be read with python: import pandas as pd df = pd.read_csv(“curlie.csv.gz“, index_col=0) JSON files have one record per line and can be read with python: import json import gzip with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file: for line in file: data = json.loads(line) …

  6. o

    Website Categorisation Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Website Categorisation Dataset [Dataset]. https://www.opendatabay.com/data/dataset/42ebfeae-a971-4d33-af3d-41401587cd49
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Website Analytics & User Experience
    Description

    This dataset provides a collection of website URLs and their corresponding cleaned text content, which have been categorised into various topics. It is designed to facilitate website classification tasks, offering valuable insights for web analytics and user experience analysis. The data was created by extracting and cleaning text from different websites, then assigning categories based on this content.

    Columns

    • index: An identifier for each row in the dataset.
    • website_url: The URL link of the website.
    • cleaned_website_text: The cleaned text content extracted from the website URL.
    • Category: The assigned category of the URL.

    Distribution

    The dataset comprises 1408 rows of data. It is typically available in a CSV file format. The categories present in the dataset include 'Education' (8%), 'Business/Corporate' (8%), and 'Other' (84%), reflecting a diverse range of website types. There are 1375 unique website URLs and 1407 unique categories.

    Usage

    This dataset is ideal for various applications, including: * Website classification: Training models to automatically assign categories to new websites. * Website analytics: Understanding the topical distribution of websites. * User experience studies: Analysing website content for improved user engagement. * Data visualisation: Creating visual representations of website categories. * Natural Language Processing (NLP) tasks: Developing and testing NLP models for text extraction and categorisation. * Multiclass classification problems: Serving as a foundation for building complex classification algorithms.

    Coverage

    The dataset offers global coverage, encompassing websites from various regions.

    License

    CCO

    Who Can Use It

    This dataset is suitable for: * Beginner data scientists and analysts looking to practice classification, NLP, and data visualisation. * Machine learning engineers developing and testing multiclass classification models. * Researchers interested in web content analysis and automatic categorisation. * Developers building applications that require website categorisation capabilities.

    Dataset Name Suggestions

    • Website Categorisation Dataset
    • Web Content Classification
    • URL Classification Data
    • Cleaned Website Text Categories
    • Web Page Classification Repository

    Attributes

    Original Data Source: Website Classification

  7. Phishing Website HTML Classification

    • kaggle.com
    Updated Apr 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hunter Kempf (2022). Phishing Website HTML Classification [Dataset]. https://www.kaggle.com/datasets/huntingdata11/phishing-website-html-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hunter Kempf
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.

    Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/

  8. i

    Netflix

    • ieee-dataport.org
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danil Shamsimukhametov (2021). Netflix [Dataset]. https://ieee-dataport.org/documents/youtube-netflix-web-dataset-encrypted-traffic-classification
    Explore at:
    Dataset updated
    Oct 1, 2021
    Authors
    Danil Shamsimukhametov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    YouTube flows

  9. Language-agnostic Website Embedding and Classification (Curlie dataset)

    • figshare.com
    txt
    Updated May 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiziano Piccardi; Sylvain Lugeon (2022). Language-agnostic Website Embedding and Classification (Curlie dataset) [Dataset]. http://doi.org/10.6084/m9.figshare.16621669.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 2, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tiziano Piccardi; Sylvain Lugeon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  10. h

    web_archive_classification

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Library (2025). web_archive_classification [Dataset]. https://huggingface.co/datasets/TheBritishLibrary/web_archive_classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    British Library
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy.

  11. o

    Phishing URL Classifier Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Phishing URL Classifier Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/705b35a9-e638-462d-a5e1-d9f70ff4234a
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Website Analytics & User Experience
    Description

    This dataset is a curated collection of over 800,000 URLs, designed to represent a variety of online domains. Approximately 52% of these domains are identified as legitimate entities, while the remaining 47% are categorised as phishing domains, indicating potential online threats. The dataset consists of two key columns: "url" and "status". The "status" column uses binary encoding, where 0 signifies phishing domains and 1 indicates legitimate domains. This balanced distribution between phishing and legitimate instances helps ensure the dataset's robustness for analysis and model development.

    Columns

    • url: This field contains the Uniform Resource Locators (URLs) for each domain, including both legitimate and phishing entries.
    • status: This field denotes the classification of the URL. A value of 0 represents a phishing domain, indicating a potential risk, while a value of 1 signifies a legitimate domain, offering assurance.

    Distribution

    The dataset is provided in a CSV file format. It contains 808,042 unique entries. The distribution of statuses is approximately 394,982 entries flagged as phishing (0) and 427,028 entries flagged as legitimate (1). This offers an almost equal balance across the two categories.

    Usage

    This dataset is ideal for applications aimed at understanding, combating, and mitigating online threats. It can be used for developing models related to phishing detection, binary classification, and website analytics. It is also suitable for data cleaning exercises and projects involving Natural Language Processing (NLP) and Deep Learning.

    Coverage

    The data collection for this dataset is global in scope. While a specific time range for data collection is not provided, the dataset was listed on 05/06/2025.

    License

    CCO

    Who Can Use It

    This dataset is particularly valuable for researchers and practitioners working in the fields of AI and Machine Learning. Intended users include those looking to: * Develop and train models for identifying malicious URLs. * Analyse patterns distinguishing legitimate websites from phishing attempts. * Enhance cybersecurity measures and protect users from online threats.

    Dataset Name Suggestions

    • URL Phishing Detection
    • Legitimate and Malicious URLs
    • Online Threat URL Dataset
    • Phishing URL Classifier Data
    • Web Security URL Collection

    Attributes

    Original Data Source: Phishing and Legitimate URLS

  12. Data from: HTTPS traffic classification

    • kaggle.com
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Đinh Ngọc Ân (2024). HTTPS traffic classification [Dataset]. https://www.kaggle.com/datasets/inhngcn/https-traffic-classification/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Đinh Ngọc Ân
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The people from Czech are publishing a dataset for the HTTPS traffic classification.

    Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

    During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

    They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:

    Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list

  13. Data from: Tree of Life Web Project Classification

    • gbif.org
    Updated Jul 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Maddison; David Maddison (2025). Tree of Life Web Project Classification [Dataset]. http://doi.org/10.15468/ibltvp
    Explore at:
    Dataset updated
    Jul 13, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Tree of Life Web Projecthttp://www.tolweb.org/
    Authors
    David Maddison; David Maddison
    Description

    The Tree of Life Web Project (ToL) is a collaborative effort of biologists and nature enthusiasts from around the world. On more than 10,000 World Wide Web pages, the project provides information about biodiversity, the characteristics of different groups of organisms, and their evolutionary history (phylogeny).

  14. d

    Land-water classification for selected sites in McFaddin NWR and J.D....

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Land-water classification for selected sites in McFaddin NWR and J.D. Murphree WMA [Dataset]. https://catalog.data.gov/dataset/land-water-classification-for-selected-sites-in-mcfaddin-nwr-and-j-d-murphree-wma
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    J.D. Murphree Wildlife Management Area
    Description

    Land-water data was derived from imagery acquired at 350 feet using unmanned aerial systems (UAS) for 6 separate study locations using the Ricoh GR II camera. Three sites are healthy marsh and three sites are degraded marshes. For each study site, ground control markers were established and surveyed in using Real Time Kinematic (RTK) survey equipment. The imagery collected has been processed to produce a land-water classification dataset for scientific research. The land-water data will not only quantify how much marsh is being affected, but the data will also provide a spatial aspect as to where these degrading marsh fragmentations are occurring. The land-water data will be correlated with other data such as salinity, prescribed burns, flooding frequency and flooding duration data to better understand what events may be causing marsh deterioration. With low resolution, vegetation types do not cause any troubling issues with classification but due to the high resolution of the imagery (1.18 inches/0.03 meters) there will be inherent “noise” that causes speckling throughout the classified image. With the image resolution at such a small Ground Sample Distance (GSD), the smallest of information will be visible. These small pieces of information that we call “noise” will be introduced into our image classification and will mostly come from vegetation shadows and some water saturation. In this study, we are attempting to identify hollows which are low areas or holes in the vegetation which may suggest a degradation of adjacent marsh. For our study analysis, a hollow is defined as an area that is .25m * .25m = 0.0625m2 (69 pixels) or greater. Any cluster of cells smaller than 69 pixels will be absorbed into the surrounding vegetation type. This method will help reduce noise and maintain confidence in the hollow identification.

  15. o

    EXTREME LEARNING MACHINE FOR CLASSIFICATION OF PHISHING WEBSITES FEATURES

    • osf.io
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IAEME; Sumeshwar Singh (2023). EXTREME LEARNING MACHINE FOR CLASSIFICATION OF PHISHING WEBSITES FEATURES [Dataset]. http://doi.org/10.17605/OSF.IO/3K47V
    Explore at:
    Dataset updated
    Apr 28, 2023
    Dataset provided by
    Center For Open Science
    Authors
    IAEME; Sumeshwar Singh
    Description

    No description was included in this Dataset collected from the OSF

  16. Confusion matrix for classification of web-scraped clothing data

    • ons.gov.uk
    • cy.ons.gov.uk
    csv
    Updated Sep 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2020). Confusion matrix for classification of web-scraped clothing data [Dataset]. https://www.ons.gov.uk/economy/inflationandpriceindices/datasets/confusionmatrixforclassificationofwebscrapedclothingdata
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 1, 2020
    Dataset provided by
    Office for National Statisticshttp://www.ons.gov.uk/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    A confusion matrix can be used to compare a machine’s predictions against human classification. We can use confusion matrices to understand the consumption segments that the classifier is struggling to distinguish between. A confusion matrix for our XGBoost classification of web-scraped clothing data is available in this data download.

  17. Dataset for Web Page Classification

    • zenodo.org
    csv
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Audrone Janaviciute; Audrone Janaviciute; Agnius Liutkevicius; Agnius Liutkevicius; Nerijus Morkevicius; Nerijus Morkevicius (2025). Dataset for Web Page Classification [Dataset]. http://doi.org/10.5281/zenodo.15828693
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Audrone Janaviciute; Audrone Janaviciute; Agnius Liutkevicius; Agnius Liutkevicius; Nerijus Morkevicius; Nerijus Morkevicius
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 7, 2025
    Description

    This dataset contains 2648 records of web sites of different categories. The first column contains an URL for the web site, while the second column contains the web site category index and name. There are 7 categories in total:

    0 – Business (508); 1 – Education (394); 2 – Adult (115); 3 – Games (385); 4 – Health (456); 5 – Sport (299); 6 – Travel (491).

    Please note that some URLs can become unavailable over time.

  18. Company classification

    • kaggle.com
    Updated Mar 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CharanPuvvala (2020). Company classification [Dataset]. https://www.kaggle.com/charanpuvvala/company-classification/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    CharanPuvvala
    Description

    Context

    Often we find in situation to classify businesses and companies across a standard taxonomy. This dataset comes with pre-classified companies along with data scraped from the website.

    Content

    The scraped data from the website includes, 1. Category: The target label into which the company is classified 2. website: The website of the company / business 3. company_name: The company / business name 4. homepage_text : Visible homepage text 5. h1: The heading 1 tags from the html of the home page 6. h2: The heading 2 tags from the html of the home page 7. h3: The heading 3 tags from the html of the home page 8. nav_link_text: The visible titles of navigation links on the homepage (Ex: Home, Services, Product, About Us, Contact Us) 9. meta_keywords: The meta keywords in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp) 10 meta_description: The meta description in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp)

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  19. o

    Data set of the article: Using Machine Learning for Web Page Classification...

    • explore.openaire.eu
    Updated Jan 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Goran Matošević; Jasminka Dobša; Dunja Mladenić (2021). Data set of the article: Using Machine Learning for Web Page Classification in Search Engine Optimization [Dataset]. http://doi.org/10.5281/zenodo.4416123
    Explore at:
    Dataset updated
    Jan 4, 2021
    Authors
    Goran Matošević; Jasminka Dobša; Dunja Mladenić
    Description

    Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization" Abstract of the article: This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.

  20. Semantic Web Parsing Dataset Part I

    • kaggle.com
    Updated Apr 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swaroop Nath (2020). Semantic Web Parsing Dataset Part I [Dataset]. https://www.kaggle.com/swaroopnath6/semantic-web-parsing-dataset-part-i
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Swaroop Nath
    Description

    Why I came up with the dataset?

    The pool of information on the internet is increasing constantly, which makes it difficult for normal users to crawl out the important information. This dataset (hyperlink here) is meant to parse out the relevant content of a website into a textual format, making it easy for users to understand the information.

    What does it have?

    The dataset contains features of various tags that have been gathered from a various number of informative websites. For more information on how the data was gathered, please visit - https://github.com/swaroop-nath/Semantic-Web-Parser (development branch).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CUBIG (2025). Website Classification Dataset [Dataset]. https://cubig.ai/store/products/138/website-classification-dataset

Website Classification Dataset

Explore at:
Dataset updated
Feb 25, 2025
Dataset authored and provided by
CUBIG
License

https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description

1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.

2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.

Search
Clear search
Close search
Google apps
Main menu