This dataset was created by DNS_dataset
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)
The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314
This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking
The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.
Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.
Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
Evaluation of the most visited health websites in the world
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
In November 2024, Google.com was the leading website in Colombia by unique visits, with around 52.9 million single accesses to the URL during that month. YouTube.com came in second with approximately 30.9 million unique monthly visits. Facebook ranked third with 24.2 million unique monthly visits.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This data, exported from Google Analytics displays the most popular 50 pages on Austintexas.gov based on the following: Views: The total number of times the page was viewed. Repeated views of a single page are counted. Bounce Rate: The percentage of single-page visits (i.e. visits in which the person left your site from the entrance page without interacting with the page).
*Note: On July 1, 2023, standard Universal Analytics properties will stop processing data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Popular Website Traffic Over Time ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/popular-website-traffice on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Background
Have you every been in a conversation and the question comes up, who uses Bing? This question comes up occasionally because people wonder if these sites have any views. For this research study, we are going to be exploring popular website traffic for many popular websites.
Methodology
The data collected originates from SimilarWeb.com.
Source
For the analysis and study, go to The Concept Center
This dataset was created by Chase Willden and contains around 0 samples along with 1/1/2017, Social Media, technical information and other features such as: - 12/1/2016 - 3/1/2017 - and more.
- Analyze 11/1/2016 in relation to 2/1/2017
- Study the influence of 4/1/2017 on 1/1/2017
- More datasets
If you use this dataset in your research, please credit Chase Willden
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
The CDC Content Syndication site at https://tools.cdc.gov/syndication/ allows you to import content from CDC websites directly into your own website or application. These services are provided free of charge from CDC. The data shown in this table represent the weekly top page views from CDC.gov offered by syndication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The numbers in the parentheses are the ranking orders according to the focus indicators.Top 15 websites with highest PageRank.
This dataset was created by Nina Luquez
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.
The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.
For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:
- the URL
- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category
- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local
- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website
- Bron: For each website we noted which source we used to find this website.
The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for 1000 Website Screenshots with Metadata
Dataset Summary
Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets This dataset includes: High-res… See the full description on the dataset page: https://huggingface.co/datasets/silatus/1k_Website_Screenshots_and_Metadata.
This dataset provides detail on how all assets on a domain are being used (e.g. views, downloads, API reads).
User activity is provided by date, asset uid, asset type, asset name, access type and user segment. Please see Site Analytics: Asset Access for more detail about these fields.
The dataset will reflect new Asset Access records within a day of when they occur.
This page serves as the backing dataset for the Top 100 HHS Websites, sorted by total page views. Please refer to the story page here for more information:https://healthdata.gov/stories/s/Top-100-HHS-Websites/d84g-3yzd
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about stocks. It has 1 row and is filtered where the company is Top Ships. It features 8 columns including stock name, company, exchange, and exchange symbol.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about stocks. It has 1 row and is filtered where the company is Top Spring International. It features 8 columns including stock name, company, exchange, and exchange symbol.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Instance of an user visiting a particular page on a website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is available on Brisbane City Council’s open data website – data.brisbane.qld.gov.au. The site provides additional features for viewing and interacting with the data and for downloading the data in various formats.
Monthly analytics reports for the Brisbane City Council website
Information regarding the sessions for Brisbane City Council website during the month including search terms used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Fortune 1000’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/winston56/fortune-500-data-2021 on 13 November 2021.
--- Dataset description provided by original source is as follows ---
Every year Fortune, an American Business Magazine, publishes the Fortune 500, which ranks the top 500 corporations by revenue. This dataset includes the entire Fortune 1000, as opposed to just the top 500.
The Fortune 1000 dataset is from the Fortune website, collected by the processes outlined in this notebook. It contains U.S. company data for the year 2021. The dataset is 1000 rows and 18 columns.
This dataset is made to explore the top corporations in the U.S. Answer questions such as: What percentage of companies have women ceo's? How many companies are newcomers? What percentage of companies have ceos who were also founders? What role does profitability play in ranking?
--- Original source retains full ownership of the source dataset ---
This dataset was created by DNS_dataset