https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world
This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories
- To track the most popular websites in the world over time
- To see how website popularity changes by region
- To find out which website categories are most popular
Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This Website Statistics dataset has four resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file.
Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year.
Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year.
Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year.
Dataset Statistics: This dataset shows cumulative totals for Datasets on the Lincolnshire Open Data site that have also been published on the national Open Data site Data.Gov.UK - see the Source link.
Note: Website and Webpage statistics (the first three resources above) show only UK users, and exclude API calls (automated requests for datasets). The Dataset Statistics are confined to users with javascript enabled, which excludes web crawlers and API calls.
These Website Statistics resources are updated annually in January by the Lincolnshire County Council Business Intelligence team. For any enquiries about the information contact opendata@lincolnshire.gov.uk.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.
Result
(Indicates whether a website is phishing or legitimate) Prefix_Suffix
β Checks if the URL contains a hyphen (-
), which is commonly used in phishing domains. double_slash_redirecting
β Detects if the URL redirects using //
, which may indicate a phishing attempt. having_At_Symbol
β Identifies the presence of @
in the URL, which can be used to deceive users. Shortining_Service
β Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl). URL_Length
β Measures the length of the URL; phishing URLs tend to be longer. having_IP_Address
β Checks if an IP address is used in place of a domain name, which is suspicious. having_Sub_Domain
β Evaluates the number of subdomains; phishing sites often have excessive subdomains. SSLfinal_State
β Indicates whether the website has a valid SSL certificate (secure connection). Domain_registeration_length
β Measures the duration of domain registration; phishing sites often have short lifespans. age_of_domain
β The age of the domain in days; older domains are usually more trustworthy. DNSRecord
β Checks if the domain has valid DNS records; phishing domains may lack these. Favicon
β Determines if the website uses an external favicon (which can be a sign of phishing). port
β Identifies if the site is using suspicious or non-standard ports. HTTPS_token
β Checks if "HTTPS" is included in the URL but is used deceptively. Request_URL
β Measures the percentage of external resources loaded from different domains. URL_of_Anchor
β Analyzes anchor tags (<a>
links) and their trustworthiness. Links_in_tags
β Examines <meta>
, <script>
, and <link>
tags for external links. SFH
(Server Form Handler) β Determines if form actions are handled suspiciously. Submitting_to_email
β Checks if forms submit data directly to an email instead of a web server. Abnormal_URL
β Identifies if the websiteβs URL structure is inconsistent with common patterns. Redirect
β Counts the number of redirects; phishing websites may have excessive redirects. on_mouseover
β Checks if the website changes content when hovered over (used in deceptive techniques). RightClick
β Detects if right-click functionality is disabled (phishing sites may disable it). popUpWindow
β Identifies the presence of pop-ups, which can be used to trick users. Iframe
β Checks if the website uses <iframe>
tags, often used in phishing attacks. web_traffic
β Measures the websiteβs Alexa ranking; phishing sites tend to have low traffic. Page_Rank
β Google PageRank score; phishing sites usually have a low PageRank. Google_Index
β Checks if the website is indexed by Google (phishing sites may not be indexed). Links_pointing_to_page
β Counts the number of backlinks pointing to the website. Statistical_report
β Uses external sources to verify if the website has been reported for phishing. Result
β The classification label (1: Legitimate, -1: Phishing) This dataset is valuable for:
β
Machine Learning Models β Developing classifiers for phishing detection.
β
Cybersecurity Research β Understanding patterns in phishing attacks.
β
Browser Security Extensions β Enhancing anti-phishing tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
Product Page is a large-scale and realistic dataset of webpages. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web.
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Percentage of Internet users who have experienced selected personal effects in their life because of the Internet and the use of social networking websites or apps, during the past 12 months.
When asked about "Attitudes towards the internet", most Japanese respondents pick "I'm concerned that my data is being misused on the internet" as an answer. 35 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
this research aims to evaluate the accessibility of the home pages of the web portals of the Ecuadorian higher education institutions ranked in the Webometrics with the Web Content Accessibility Guidelines (WCAG) 2.1 of the World Wide Web Consortium.
When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place" as an answer. 56 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Automated Web Design Analysis: By identifying various UI elements, dataset-v2 can help designers and developers analyze existing web designs for improvements or optimization, providing insights on the UI structure, accessibility, and user-friendliness.
Content Management System (CMS) Auto-tagging: Integrate dataset-v2 with a CMS to automatically scan and tag visual elements within web pages, simplifying asset management and organization for website developers and content creators.
Accessibility Compliance: Dataset-v2 can analyze websites to ensure proper UI elements usage, helping organizations adhere to accessibility guidelines and standards, such as the Web Content Accessibility Guidelines (WCAG).
Prototype Testing and Feedback: Dataset-v2 can help UX/UI designers evaluate prototypes by identifying UI components and their placement, offering objective feedback and highlighting areas for improvement in the design process.
Competitive Analysis and Web Scraping: Dataset-v2 can identify UI elements across multiple websites, empowering businesses to analyze competitor websites and extract valuable design patterns, best practices, and trends for UI/UX applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comprises phishing and legitimate web pages, which have been used for experiments on early phishing detection.
Detailed information on the dataset and data collection is available at
Bram van Dooremaal, Pavlo Burda, Luca Allodi, and Nicola Zannone. 2021.Combining Text and Visual Features to Improve the Identification of Cloned Webpages for Early Phishing Detection. In ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security. ACM.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Webpages is a dataset for object detection tasks - it contains Webpage Elements annotations for 202 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
We labeled 7,740 webpage screenshots spanning 408 domains (Amazon, Walmart, Target, etc.). Each of these webpages contains exactly one labeled price, title, and image. All other web elements are labeled as background. On average, there are 90 web elements in a webpage.
Webpage screenshots and bounding boxes can be obtained here
Train-Val-Test split We create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains. Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains. We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target. So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The task is to predict whether an image is an advertisement ("ad") or not ("nonad").
There are 1559 columns in the data.Each row in the data represent one image which is tagged as ad or nonad in the last column.column 0 to 1557 represent the actual numerical attributes of the images
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Here is a BiBTeX citation as well:
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } https://archive.ics.uci.edu/ml/citation_policy.html
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)
The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314
This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking
The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.
Thank you my friend AndrΓ© (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.
Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
Description
Magi Open Information Extraction Dataset (MOIED) is a Chinese Open IE dataset containing 7,618,181 records extracted from plain text across 3,319,763 webpages in various domains. Each record in the dataset consists of the (subject, predicate, object) tuple, the associated confidence score, and the context information. The dataset comprises 1,427,742 distinct facts of 272,522 entities and 117,731 predicates.
A notable property of MOIED is that each distinct fact has multiple records with URLs referring to mentions in diverse contexts, which enables multiple-instance learning (MIL) and other correlative approaches.
As a paragraph level Open IE dataset, at least 45.1% of the records in MOIED can only be extracted through synthesizing information from multiple sentences.
Magi is an extraction engine that continuously learns from the Internet, which combines cross-referencing, timeline analysis, and other heuristics to mitigate the inevitable false positives in the extractions. All records in MOIED were randomly sampled from a database dump of magi.com in January 2020. To provide more reliable evaluation results, human annotators examined the dataset and selected 19,161 verified records for the dev and test sets.
Disclaimers
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
**************** Full Curlie dataset **************** Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples mapping.json.gz > [english_label, matchings] x 35,946 labels **************** Processed Curlie dataset **************** We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. curlie_filtered.csv.gz > [url, uid, label, lang] x 933,416 samples class_vector.json.gz > [uid, class_vector] x 885,582 samples html_content.json.gz > [uid, html] x 885,582 samples visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples class_names.txt > [class_name] x 14 classes train_uid.txt > [uid] x 797,023 samples test_uid.txt > [uid] x 88,559 samples **************** Enriched Curlie dataset **************** Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). outputs.json.gz > [uid, url, score, embedding] x 885,582 samples **************** Pretrained Homepage2Vec**************** h2v_1000_100.zip > Model pretrained on all features h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots) **************** Notes **************** CSV file can be read with python: import pandas as pd df = pd.read_csv(βcurlie.csv.gzβ, index_col=0) JSON files have one record per line and can be read with python: import json import gzip with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file: for line in file: data = json.loads(line) β¦
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains 1,517 URIs with content determined to be in the Korean language. The URIs were collected from DMOZ. All 1,517 URIs were available on the live Web as of December 2015.
This data is used and further described in the journal article: Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2017. Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages. ACM Transactions on Information Systems (TOIS).
This work was an extension of the paper: Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2015. How Well Are Arabic Websites Archived?. In Proceedings of the 15th IEEE/ACM Joint Conference on Digital Libraries (JCDL). ACM
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is XHTML in easy steps : web pages for the desktop and mobile internet. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world
This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories
- To track the most popular websites in the world over time
- To see how website popularity changes by region
- To find out which website categories are most popular
Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |