99 datasets found

Top Visited Websites
kaggle.com
Updated Nov 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Top Visited Websites [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-top-websites-in-the-world/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 19, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Top Websites in the World

How They Change Over Time

About this dataset

This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world

How to use the dataset

This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories

Research Ideas

To track the most popular websites in the world over time

To see how website popularity changes by region

To find out which website categories are most popular

Acknowledgements

Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |
Website Statistics
data.wu.ac.at
data.europa.eu
csv, pdf
Updated Jun 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lincolnshire County Council (2018). Website Statistics [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/M2ZkZDBjOTUtMzNhYi00YWRjLWI1OWMtZmUzMzA5NjM0ZTdk
Explore at:
csv, pdfAvailable download formats
Dataset updated
Jun 11, 2018
Dataset provided by
Lincolnshire County Councilhttp://www.lincolnshire.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
This Website Statistics dataset has four resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file.

Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year.

Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year.

Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year.

Dataset Statistics: This dataset shows cumulative totals for Datasets on the Lincolnshire Open Data site that have also been published on the national Open Data site Data.Gov.UK - see the Source link.

Note: Website and Webpage statistics (the first three resources above) show only UK users, and exclude API calls (automated requests for datasets). The Dataset Statistics are confined to users with javascript enabled, which excludes web crawlers and API calls.

These Website Statistics resources are updated annually in January by the Lincolnshire County Council Business Intelligence team. For any enquiries about the information contact opendata@lincolnshire.gov.uk.
🕵️ Phishing Websites Data
kaggle.com
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sairaj Adhav (2025). 🕵️ Phishing Websites Data [Dataset]. https://www.kaggle.com/datasets/sai10py/phishing-websites-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sairaj Adhav
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Phishing Websites Dataset

Overview

This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.

Dataset Information

Total Columns: 31 (30 Features + 1 Target)

Target Variable: Result (Indicates whether a website is phishing or legitimate)

Features Description

URL-Based Features

Prefix_Suffix – Checks if the URL contains a hyphen (-), which is commonly used in phishing domains.

double_slash_redirecting – Detects if the URL redirects using //, which may indicate a phishing attempt.

having_At_Symbol – Identifies the presence of @ in the URL, which can be used to deceive users.

Shortining_Service – Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl).

URL_Length – Measures the length of the URL; phishing URLs tend to be longer.

having_IP_Address – Checks if an IP address is used in place of a domain name, which is suspicious.

Domain-Based Features

having_Sub_Domain – Evaluates the number of subdomains; phishing sites often have excessive subdomains.

SSLfinal_State – Indicates whether the website has a valid SSL certificate (secure connection).

Domain_registeration_length – Measures the duration of domain registration; phishing sites often have short lifespans.

age_of_domain – The age of the domain in days; older domains are usually more trustworthy.

DNSRecord – Checks if the domain has valid DNS records; phishing domains may lack these.

Webpage-Based Features

Favicon – Determines if the website uses an external favicon (which can be a sign of phishing).

port – Identifies if the site is using suspicious or non-standard ports.

HTTPS_token – Checks if "HTTPS" is included in the URL but is used deceptively.

Request_URL – Measures the percentage of external resources loaded from different domains.

URL_of_Anchor – Analyzes anchor tags (<a> links) and their trustworthiness.

Links_in_tags – Examines <meta>, <script>, and <link> tags for external links.

SFH (Server Form Handler) – Determines if form actions are handled suspiciously.

Submitting_to_email – Checks if forms submit data directly to an email instead of a web server.

Abnormal_URL – Identifies if the website’s URL structure is inconsistent with common patterns.

Redirect – Counts the number of redirects; phishing websites may have excessive redirects.

Behavior-Based Features

on_mouseover – Checks if the website changes content when hovered over (used in deceptive techniques).

RightClick – Detects if right-click functionality is disabled (phishing sites may disable it).

popUpWindow – Identifies the presence of pop-ups, which can be used to trick users.

Iframe – Checks if the website uses <iframe> tags, often used in phishing attacks.

Traffic & Search Engine Features

web_traffic – Measures the website’s Alexa ranking; phishing sites tend to have low traffic.

Page_Rank – Google PageRank score; phishing sites usually have a low PageRank.

Google_Index – Checks if the website is indexed by Google (phishing sites may not be indexed).

Links_pointing_to_page – Counts the number of backlinks pointing to the website.

Statistical_report – Uses external sources to verify if the website has been reported for phishing.

Target Variable

Result – The classification label (1: Legitimate, -1: Phishing)

Usage

This dataset is valuable for:
✅ Machine Learning Models – Developing classifiers for phishing detection.
✅ Cybersecurity Research – Understanding patterns in phishing attacks.
✅ Browser Security Extensions – Enhancing anti-phishing tools.
i
Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...
ieee-dataport.org
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Amar Irsyad Mohd Aminuddin (2024). Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and Mobile Webpages [Dataset]. https://ieee-dataport.org/documents/website-fingerprinting-dataset-browsing-network-traffic-desktop-and-mobile-webpages
Explore at:
Dataset updated
Oct 21, 2024
Authors
Mohamad Amar Irsyad Mohd Aminuddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
P
Product Page Dataset
paperswithcode.com
opendatalab.com
Updated Nov 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren (2021). Product Page Dataset [Dataset]. https://paperswithcode.com/dataset/product-page
Explore at:
Dataset updated
Nov 19, 2021
Authors
Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren
Description
Product Page is a large-scale and realistic dataset of webpages. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web.
Number of internet users worldwide 2014-2029
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
World
Description
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
G
Adverse effects of using the Internet and social networking websites or apps...
open.canada.ca
www150.statcan.gc.ca
+2more
csv, html, xml
Updated Jan 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Canada (2023). Adverse effects of using the Internet and social networking websites or apps by gender and age group, inactive [Dataset]. https://open.canada.ca/data/en/dataset/80c88ac9-8ea1-4ff7-856e-560f7683d660
Explore at:
html, xml, csvAvailable download formats
Dataset updated
Jan 17, 2023
Dataset provided by
Statistics Canada
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
Percentage of Internet users who have experienced selected personal effects in their life because of the Internet and the use of social networking websites or apps, during the past 12 months.
Attitudes towards the internet in Japan 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in Japan 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Japanese respondents pick "I'm concerned that my data is being misused on the internet" as an answer. 35 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
i
Data from: A dataset on the evaluation of the accessibility of the home...
ieee-dataport.org
observatorio-cientifico.ua.es
+2more
Updated Aug 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Milton Campoverde Molina (2021). A dataset on the evaluation of the accessibility of the home pages of the web portals of Ecuadorian higher education institutions ranked in Webometrics [Dataset]. https://ieee-dataport.org/documents/dataset-evaluation-accessibility-home-pages-web-portals-ecuadorian-higher-education
Explore at:
Dataset updated
Aug 26, 2021
Authors
Milton Campoverde Molina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
this research aims to evaluate the accessibility of the home pages of the web portals of the Ecuadorian higher education institutions ranked in the Webometrics with the Web Content Accessibility Guidelines (WCAG) 2.1 of the World Wide Web Consortium.
Attitudes towards the internet in Mexico 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in Mexico 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place" as an answer. 56 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
R
Dataset V2 Dataset
universe.roboflow.com
zip
Updated Jan 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UIbitz (2023). Dataset V2 Dataset [Dataset]. https://universe.roboflow.com/uibitz/dataset-v2
Explore at:
zipAvailable download formats
Dataset updated
Jan 2, 2023
Dataset authored and provided by
UIbitz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Ui Elements Bounding Boxes
Description
Here are a few use cases for this project:

Automated Web Design Analysis: By identifying various UI elements, dataset-v2 can help designers and developers analyze existing web designs for improvements or optimization, providing insights on the UI structure, accessibility, and user-friendliness.

Content Management System (CMS) Auto-tagging: Integrate dataset-v2 with a CMS to automatically scan and tag visual elements within web pages, simplifying asset management and organization for website developers and content creators.

Accessibility Compliance: Dataset-v2 can analyze websites to ensure proper UI elements usage, helping organizations adhere to accessibility guidelines and standards, such as the Web Content Accessibility Guidelines (WCAG).

Prototype Testing and Feedback: Dataset-v2 can help UX/UI designers evaluate prototypes by identifying UI components and their placement, offering objective feedback and highlighting areas for improvement in the design process.

Competitive Analysis and Web Scraping: Dataset-v2 can identify UI elements across multiple websites, empowering businesses to analyze competitor websites and extract valuable design patterns, best practices, and trends for UI/UX applications.
Z
Phishing website dataset
data.niaid.nih.gov
Updated Jun 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
van Dooremaal, Bram (2021). Phishing website dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4922597
Explore at:
Dataset updated
Jun 10, 2021
Dataset provided by
van Dooremaal, Bram
Allodi, Luca
Burda, Pavlo
Zannone, Nicola
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset comprises phishing and legitimate web pages, which have been used for experiments on early phishing detection.

Detailed information on the dataset and data collection is available at

Bram van Dooremaal, Pavlo Burda, Luca Allodi, and Nicola Zannone. 2021.Combining Text and Visual Features to Improve the Identification of Cloned Webpages for Early Phishing Detection. In ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security. ACM.
R
Webpages Dataset
universe.roboflow.com
zip
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
workflowaugmentation (2025). Webpages Dataset [Dataset]. https://universe.roboflow.com/workflowaugmentation/webpages-abgy4
Explore at:
zipAvailable download formats
Dataset updated
Feb 6, 2025
Dataset authored and provided by
workflowaugmentation
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Webpage Elements Bounding Boxes
Description
Webpages

## Overview Webpages is a dataset for object detection tasks - it contains Webpage Elements annotations for 202 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
P
CoVA Dataset
paperswithcode.com
Updated Oct 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anurendra Kumar; Keval Morabia; Jingjin Wang; Kevin Chen-Chuan Chang; Alexander Schwing (2021). CoVA Dataset [Dataset]. https://paperswithcode.com/dataset/cova
Explore at:
Dataset updated
Oct 23, 2021
Authors
Anurendra Kumar; Keval Morabia; Jingjin Wang; Kevin Chen-Chuan Chang; Alexander Schwing
Description
We labeled 7,740 webpage screenshots spanning 408 domains (Amazon, Walmart, Target, etc.). Each of these webpages contains exactly one labeled price, title, and image. All other web elements are labeled as background. On average, there are 90 web elements in a webpage.

Webpage screenshots and bounding boxes can be obtained here

Train-Val-Test split We create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains. Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains. We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target. So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data.
Internet Advertisements Data Set
kaggle.com
Updated Sep 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Internet Advertisements Data Set [Dataset]. https://www.kaggle.com/uciml/internet-advertisements-data-set/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
UCI Machine Learning
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The task is to predict whether an image is an advertisement ("ad") or not ("nonad").

Content

There are 1559 columns in the data.Each row in the data represent one image which is tagged as ad or nonad in the last column.column 0 to 1557 represent the actual numerical attributes of the images

Acknowledgements

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Here is a BiBTeX citation as well:

@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } https://archive.ics.uci.edu/ml/citation_policy.html
Most visited websites by hierachycal categories
kaggle.com
Updated Sep 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natanael de Souza Figueiredo (2020). Most visited websites by hierachycal categories [Dataset]. https://www.kaggle.com/natanael127/most-visited-websites-by-hierachycal-categories/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Natanael de Souza Figueiredo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

Content

The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

Acknowledgements

Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

Inspiration

Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
MOIED: Magi Open Information Extraction Dataset
zenodo.org
explore.openaire.eu
+1more
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yichao Ji; Yichao Ji; Xinyang Liu; Xinyang Liu; Kui Ma; Kui Ma; Xuezhi Zhao; Xuezhi Zhao; Qiao Sun; Qiao Sun (2024). MOIED: Magi Open Information Extraction Dataset [Dataset]. http://doi.org/10.5281/zenodo.3666039
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3666039
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yichao Ji; Yichao Ji; Xinyang Liu; Xinyang Liu; Kui Ma; Kui Ma; Xuezhi Zhao; Xuezhi Zhao; Qiao Sun; Qiao Sun
Description
Description

Magi Open Information Extraction Dataset (MOIED) is a Chinese Open IE dataset containing 7,618,181 records extracted from plain text across 3,319,763 webpages in various domains. Each record in the dataset consists of the (subject, predicate, object) tuple, the associated confidence score, and the context information. The dataset comprises 1,427,742 distinct facts of 272,522 entities and 117,731 predicates.

A notable property of MOIED is that each distinct fact has multiple records with URLs referring to mentions in diverse contexts, which enables multiple-instance learning (MIL) and other correlative approaches.

As a paragraph level Open IE dataset, at least 45.1% of the records in MOIED can only be extracted through synthesizing information from multiple sentences.

Magi is an extraction engine that continuously learns from the Internet, which combines cross-referencing, timeline analysis, and other heuristics to mitigate the inevitable false positives in the extractions. All records in MOIED were randomly sampled from a database dump of magi.com in January 2020. To provide more reliable evaluation results, human annotators examined the dataset and selected 19,161 verified records for the dev and test sets.

Disclaimers

The dataset is expected to be used in weakly supervised scenarios since the records in the training set are not human-annotated and could be imprecise or erroneous.

Records are not guaranteed to be universally correct. The correctness of extractions should be evaluated based on contexts (specified by the URLs).

The extraction was made at a certain time Magi visits the URL, thus it is not guaranteed that the URL is still accessible, or the content is unmodified since the extraction was conducted.

Due to legal and regulatory issues, the webpage URLs are mostly ones accessible from Mainland China, yet, the content of certain webpages, as well as the extraction results, could be in violation of law and regulation of certain countries or regions in certain ways.
Curlie Dataset - Language-agnostic Website Embedding and Classification
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sylvain Lugeon; Tiziano Piccardi (2023). Curlie Dataset - Language-agnostic Website Embedding and Classification [Dataset]. http://doi.org/10.6084/m9.figshare.19406693.v5
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19406693.v5
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Sylvain Lugeon; Tiziano Piccardi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
**************** Full Curlie dataset **************** Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples mapping.json.gz > [english_label, matchings] x 35,946 labels **************** Processed Curlie dataset **************** We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. curlie_filtered.csv.gz > [url, uid, label, lang] x 933,416 samples class_vector.json.gz > [uid, class_vector] x 885,582 samples html_content.json.gz > [uid, html] x 885,582 samples visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples class_names.txt > [class_name] x 14 classes train_uid.txt > [uid] x 797,023 samples test_uid.txt > [uid] x 88,559 samples **************** Enriched Curlie dataset **************** Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). outputs.json.gz > [uid, url, score, embedding] x 885,582 samples **************** Pretrained Homepage2Vec**************** h2v_1000_100.zip > Model pretrained on all features h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots) **************** Notes **************** CSV file can be read with python: import pandas as pd df = pd.read_csv(“curlie.csv.gz“, index_col=0) JSON files have one record per line and can be read with python: import json import gzip with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file: for line in file: data = json.loads(line) …
Korean language Web pages dataset
figshare.com
txt
Updated Jan 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lulwah Alkwai (2017). Korean language Web pages dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4588735.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4588735.v1
Dataset updated
Jan 29, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lulwah Alkwai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains 1,517 URIs with content determined to be in the Korean language. The URIs were collected from DMOZ. All 1,517 URIs were available on the live Web as of December 2015.

This data is used and further described in the journal article: Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2017. Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages. ACM Transactions on Information Systems (TOIS).

This work was an extension of the paper: Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2015. How Well Are Arabic Websites Archived?. In Proceedings of the 15th IEEE/ACM Joint Conference on Digital Libraries (JCDL). ACM
w
Dataset of book subjects that contain XHTML in easy steps : web pages for...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain XHTML in easy steps : web pages for the desktop and mobile internet [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=XHTML+in+easy+steps+:+web+pages+for+the+desktop+and+mobile+internet&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 1 row and is filtered where the books is XHTML in easy steps : web pages for the desktop and mobile internet. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2022). Top Visited Websites [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-top-websites-in-the-world/discussion

Top Visited Websites

A dataset of the top visited websites on the internet

Explore at:

71 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 19, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

The Top Websites in the World

How They Change Over Time

About this dataset

This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world

How to use the dataset

This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories

Research Ideas

To track the most popular websites in the world over time

To see how website popularity changes by region

To find out which website categories are most popular

Acknowledgements

Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |

Clear search

Close search

Google apps

Main menu

Top Visited Websites

The Top Websites in the World

How They Change Over Time

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Website Statistics

🕵️ Phishing Websites Data

Phishing Websites Dataset

Overview

Dataset Information

Features Description

URL-Based Features

Domain-Based Features

Webpage-Based Features

Behavior-Based Features

Traffic & Search Engine Features

Target Variable

Usage

Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...

Product Page Dataset

Number of internet users worldwide 2014-2029

Adverse effects of using the Internet and social networking websites or apps...

Attitudes towards the internet in Japan 2025

Data from: A dataset on the evaluation of the accessibility of the home...

Attitudes towards the internet in Mexico 2025

Dataset V2 Dataset

Phishing website dataset

Webpages Dataset

Webpages

CoVA Dataset

Internet Advertisements Data Set

Context

Content

Acknowledgements

Most visited websites by hierachycal categories

Context

Content

Acknowledgements

Inspiration

MOIED: Magi Open Information Extraction Dataset

Curlie Dataset - Language-agnostic Website Embedding and Classification

Korean language Web pages dataset

Dataset of book subjects that contain XHTML in easy steps : web pages for...

Top Visited Websites

A dataset of the top visited websites on the internet

The Top Websites in the World

How They Change Over Time

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns