https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world
This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories
- To track the most popular websites in the world over time
- To see how website popularity changes by region
- To find out which website categories are most popular
Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |
We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.
Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment
We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.
Receive data in any format you need: Excel, CSV, JSON, or any other.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to aid in the analysis and detection of phishing websites. It contains various features that help distinguish between legitimate and phishing websites based on their structural, security, and behavioral attributes.
Result
(Indicates whether a website is phishing or legitimate) Prefix_Suffix
– Checks if the URL contains a hyphen (-
), which is commonly used in phishing domains. double_slash_redirecting
– Detects if the URL redirects using //
, which may indicate a phishing attempt. having_At_Symbol
– Identifies the presence of @
in the URL, which can be used to deceive users. Shortining_Service
– Indicates whether the URL uses a shortening service (e.g., bit.ly, tinyurl). URL_Length
– Measures the length of the URL; phishing URLs tend to be longer. having_IP_Address
– Checks if an IP address is used in place of a domain name, which is suspicious. having_Sub_Domain
– Evaluates the number of subdomains; phishing sites often have excessive subdomains. SSLfinal_State
– Indicates whether the website has a valid SSL certificate (secure connection). Domain_registeration_length
– Measures the duration of domain registration; phishing sites often have short lifespans. age_of_domain
– The age of the domain in days; older domains are usually more trustworthy. DNSRecord
– Checks if the domain has valid DNS records; phishing domains may lack these. Favicon
– Determines if the website uses an external favicon (which can be a sign of phishing). port
– Identifies if the site is using suspicious or non-standard ports. HTTPS_token
– Checks if "HTTPS" is included in the URL but is used deceptively. Request_URL
– Measures the percentage of external resources loaded from different domains. URL_of_Anchor
– Analyzes anchor tags (<a>
links) and their trustworthiness. Links_in_tags
– Examines <meta>
, <script>
, and <link>
tags for external links. SFH
(Server Form Handler) – Determines if form actions are handled suspiciously. Submitting_to_email
– Checks if forms submit data directly to an email instead of a web server. Abnormal_URL
– Identifies if the website’s URL structure is inconsistent with common patterns. Redirect
– Counts the number of redirects; phishing websites may have excessive redirects. on_mouseover
– Checks if the website changes content when hovered over (used in deceptive techniques). RightClick
– Detects if right-click functionality is disabled (phishing sites may disable it). popUpWindow
– Identifies the presence of pop-ups, which can be used to trick users. Iframe
– Checks if the website uses <iframe>
tags, often used in phishing attacks. web_traffic
– Measures the website’s Alexa ranking; phishing sites tend to have low traffic. Page_Rank
– Google PageRank score; phishing sites usually have a low PageRank. Google_Index
– Checks if the website is indexed by Google (phishing sites may not be indexed). Links_pointing_to_page
– Counts the number of backlinks pointing to the website. Statistical_report
– Uses external sources to verify if the website has been reported for phishing. Result
– The classification label (1: Legitimate, -1: Phishing) This dataset is valuable for:
✅ Machine Learning Models – Developing classifiers for phishing detection.
✅ Cybersecurity Research – Understanding patterns in phishing attacks.
✅ Browser Security Extensions – Enhancing anti-phishing tools.
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)
The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314
This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking
The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.
Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.
Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place at any time" as an answer. 55 percent did so in our online survey in 2024. Looking to gain valuable insights about users of internet providers worldwide? Check out our
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is just prepared data from crunchyroll web scraped data using code line here I extracted meta-data from crunchyroll websites.
Each row represented a series in popular page. note: some information not updated ( I guess Crunchyroll not update is Popular table in Database )
It's also have similar feature as popular.csv but updated data points.
Each row represented a season from it's corresponding series.
Information about individual episodes from it's corresponding series.
Some series have featured music collection.
Mapping full representation of audio version of episode dubbed.
Mapping each categories of series ,it defined by crunchyroll.
When asked about "Attitudes towards the internet", most Chinese respondents pick "It is important to me to have mobile internet access in any place at any time" as an answer. 49 percent did so in our online survey in 2024. Looking to gain valuable insights about users of internet providers worldwide? Check out our
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Percentage of Internet users who have experienced selected personal effects in their life because of the Internet and the use of social networking websites or apps, during the past 12 months.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action… See the full description on the dataset page: https://huggingface.co/datasets/osunlp/Mind2Web.
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Percentage of Canadians who have experienced selected personal effects in their life because of the Internet and the use of social networking websites or apps, during the past 12 months.
When asked about "Attitudes towards the internet", most Japanese respondents pick "I could no longer imagine my everyday life without the internet" as an answer. 56 percent did so in our online survey in 2024. Looking to gain valuable insights about users of internet providers worldwide? Check out our
Netlas.io is a set of internet intelligence apps that provide accurate technical information on IP addresses, domain names, websites, web applications, IoT devices, and other online assets.
Netlas.io scans every IPv4 address and every known domain name utilizing such protocols as HTTP, FTP, SMTP, POP3, IMAP, SMB/CIFS, SSH, Telnet, SQL and others. Collected data is enriched with additional info and available in Netlas.io Search Engine. Some parts of Netlas.io database is available as downloadable datasets.
Netlas.io accumulates domain names to make internet scan coverage as wide as possible. Domain names are collected from ICANN Centralized Zone Data Service, SSL Certificates, 301 & 302 HTTP redirects (while scanning) and other sources.
This dataset contains domains and subdomains (all gTLD and ccTLD), that have at least one associated DNS registry entry (A, MX, NS, CNAME and TXT records).
The WebUI dataset contains 400K web UIs captured over a period of 3 months and cost about $500 to crawl. We grouped web pages together by their domain name, then generated training (70%), validation (10%), and testing (20%) splits. This ensured that similar pages from the same website must appear in the same split. We created four versions of the training dataset. Three of these splits were generated by randomly sampling a subset of the training split: Web-7k, Web-70k, Web-350k. We chose 70k as a baseline size, since it is approximately the size of existing UI datasets. We also generated an additional split (Web-7k-Resampled) to provide a small, higher quality split for experimentation. Web-7k-Resampled was generated using a class-balancing sampling technique, and we removed screens with possible visual defects (e.g., very small, occluded, or invisible elements). The validation and test split was always kept the same.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.
Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.
Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.
Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.
Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.
When asked about "Attitudes towards the internet", most Australian respondents pick "It is important to me to have mobile internet access in any place at any time" as an answer. 53 percent did so in our online survey in 2024. Looking to gain valuable insights about users of internet providers worldwide? Check out our
The population share with mobile internet access in North America was forecast to increase between 2024 and 2029 by in total 2.9 percentage points. This overall increase does not happen continuously, notably not in 2028 and 2029. The mobile internet penetration is estimated to amount to 84.21 percent in 2029. Notably, the population share with mobile internet access of was continuously increasing over the past years.The penetration rate refers to the share of the total population having access to the internet via a mobile broadband connection.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the population share with mobile internet access in countries like Caribbean and Europe.
The global number of smartphone users in was forecast to continuously increase between 2024 and 2029 by in total 1.8 billion users (+42.62 percent). After the ninth consecutive increasing year, the smartphone user base is estimated to reach 6.1 billion users and therefore a new peak in 2029. Notably, the number of smartphone users of was continuously increasing over the past years.Smartphone users here are limited to internet users of any age using a smartphone. The shown figures have been derived from survey data that has been processed to estimate missing demographics.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of smartphone users in countries like Australia & Oceania and Asia.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset came from a desire for stretching my web scrapping skills as well as to trian a LSTM network to maybe compose some lyrics. I detailed how I obtained the data here: Scraping lyrics from Vagalume.
All the data were obtained by scraping the Brazilian website Vagalume using R.
There are two datasets artists-data.csv
and lyrics-data.csv
, originally they had data on only 6 musical genres, but on the last uptade i scraped all lyrics from the website.
This data is scraped from the Vagalume website, so it depends on their endavour on storing and sharing milions of song lyrics.
The data scraping of this dataset was inspired by the desire to analyze the data on music and train a LSTM to compose lyrics.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world
This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories
- To track the most popular websites in the world over time
- To see how website popularity changes by region
- To find out which website categories are most popular
Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |