96 datasets found

Traces captured by visiting the top 1500 website
kaggle.com
zip
Updated Aug 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website
Explore at:
zip(5852806 bytes)Available download formats
Dataset updated
Aug 25, 2021
Authors
DNS_dataset
Description
Dataset

This dataset was created by DNS_dataset

Contents
Colombia: most visited websites 2024, by unique visitors
statista.com
ai-chatbox.pro
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Colombia: most visited websites 2024, by unique visitors [Dataset]. https://www.statista.com/statistics/1409003/most-visited-websites-unique-visitors-colombia/
Explore at:
Dataset updated
Jun 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2024
Area covered
Colombia
Description
In November 2024, Google.com was the leading website in Colombia by unique visits, with around 52.9 million single accesses to the URL during that month. YouTube.com came in second with approximately 30.9 million unique monthly visits. Facebook ranked third with 24.2 million unique monthly visits.
n
(Dataset) The most visited health websites in the world
narcis.nl
data.mendeley.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Acosta-Vargas, P (via Mendeley Data) (2021). (Dataset) The most visited health websites in the world [Dataset]. http://doi.org/10.17632/n468trh5my.1
Explore at:
Unique identifier
https://doi.org/10.17632/n468trh5my.1
Dataset updated
Jan 11, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Acosta-Vargas, P (via Mendeley Data)
Description
Evaluation of the most visited health websites in the world
Most visited websites by hierachycal categories
kaggle.com
Updated Sep 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natanael de Souza Figueiredo (2020). Most visited websites by hierachycal categories [Dataset]. https://www.kaggle.com/natanael127/most-visited-websites-by-hierachycal-categories/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Natanael de Souza Figueiredo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

Content

The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

Acknowledgements

Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

Inspiration

Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
A
‘Popular Website Traffic Over Time ’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Popular Website Traffic Over Time ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-popular-website-traffic-over-time-62e4/62549059/?iid=003-357&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Popular Website Traffic Over Time ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/popular-website-traffice on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Background

Have you every been in a conversation and the question comes up, who uses Bing? This question comes up occasionally because people wonder if these sites have any views. For this research study, we are going to be exploring popular website traffic for many popular websites.

Methodology

The data collected originates from SimilarWeb.com.

Source

For the analysis and study, go to The Concept Center

This dataset was created by Chase Willden and contains around 0 samples along with 1/1/2017, Social Media, technical information and other features such as: - 12/1/2016 - 3/1/2017 - and more.

How to use this dataset

Analyze 11/1/2016 in relation to 2/1/2017

Study the influence of 4/1/2017 on 1/1/2017

More datasets

Acknowledgements

If you use this dataset in your research, please credit Chase Willden

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
i
Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...
ieee-dataport.org
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Amar Irsyad Mohd Aminuddin (2024). Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and Mobile Webpages [Dataset]. https://ieee-dataport.org/documents/website-fingerprinting-dataset-browsing-network-traffic-desktop-and-mobile-webpages
Explore at:
Dataset updated
Oct 21, 2024
Authors
Mohamad Amar Irsyad Mohd Aminuddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
O
Top 50 Pages By Pageviews on Austintexas.gov -
data.austintexas.gov
gimi9.com
+1more
application/rdfxml +5
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Austin, Texas - data.austintexas.gov (2023). Top 50 Pages By Pageviews on Austintexas.gov - [Dataset]. https://data.austintexas.gov/City-Government/Top-50-Pages-By-Pageviews-on-Austintexas-gov-/8yfa-b3bq
Explore at:
csv, xml, application/rdfxml, application/rssxml, json, tsvAvailable download formats
Dataset updated
Dec 6, 2023
Dataset authored and provided by
City of Austin, Texas - data.austintexas.gov
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This data, exported from Google Analytics displays the most popular 50 pages on Austintexas.gov based on the following: Views: The total number of times the page was viewed. Repeated views of a single page are counted. Bounce Rate: The percentage of single-page visits (i.e. visits in which the person left your site from the entrance page without interacting with the page).

*Note: On July 1, 2023, standard Universal Analytics properties will stop processing data.
Z
Dataset used for detecting DNS over HTTPS by Machine Learning.
data.niaid.nih.gov
zenodo.org
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vekshin,Dmitrii (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818004
Explore at:
Dataset updated
Oct 28, 2020
Dataset provided by
Vekshin,Dmitrii
Cejka,Tomas
Hynek,Karel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset consists of three different data sources:

DoH enabled Firefox

DoH enabled Google Chrome

Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

Label (1 - Doh, 0 - regular HTTPS)

Data source

Duration

Minimal Inter-Packet Delay

Maximal Inter-Packet Delay

Average Inter-Packet Delay

A variance of Incoming Packet Sizes

A variance of Outgoing Packet Sizes

A ratio of the number of Incoming and outgoing bytes

A ration of the number of Incoming and outgoing packets

Average of Incoming Packet sizes

Average of Outgoing Packet sizes

The median value of Incoming Packet sizes

The median value of outgoing Packet sizes

The ratio of bursts and pauses

Number of bursts

Number of pauses

Autocorrelation

Transmission symmetry in the 1st third of connection

Transmission symmetry in the 2nd third of connection

Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information.

The zip file structure is:

|-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | -- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | |-- firefox | -- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared |-- firefox |-- LICENSE `-- README.md

When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }
c
Most popular websites in the Netherlands 2015
datacatalogue.cessda.eu
ssh.datastations.nl
Updated Jul 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Kleppe; H. Bijleveld (2023). Most popular websites in the Netherlands 2015 [Dataset]. http://doi.org/10.17026/dans-x6h-6qqt
Explore at:
Unique identifier
https://doi.org/10.17026/dans-x6h-6qqt
Dataset updated
Jul 4, 2023
Dataset provided by
Vrije Universiteit Amsterdam
Authors
M. Kleppe; H. Bijleveld
Area covered
Netherlands
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.
The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.
For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:
- the URL
- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category
- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local
- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website
- Bron: For each website we noted which source we used to find this website.
The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
h
1k_Website_Screenshots_and_Metadata
huggingface.co
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silatus (2023). 1k_Website_Screenshots_and_Metadata [Dataset]. https://huggingface.co/datasets/silatus/1k_Website_Screenshots_and_Metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Silatus
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for 1000 Website Screenshots with Metadata

Dataset Summary

Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets This dataset includes: High-res… See the full description on the dataset page: https://huggingface.co/datasets/silatus/1k_Website_Screenshots_and_Metadata.
h
UI-Elements-Detection-Dataset
huggingface.co
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yash Jain (2024). UI-Elements-Detection-Dataset [Dataset]. https://huggingface.co/datasets/YashJain/UI-Elements-Detection-Dataset
Explore at:
Dataset updated
Nov 26, 2024
Authors
Yash Jain
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Web UI Elements Dataset

Overview

A comprehensive dataset of web user interface elements collected from the world's most visited websites. This dataset is specifically curated for training AI models to detect and classify UI components, enabling automated UI testing, accessibility analysis, and interface design studies.

Key Features

300+ popular websites sampled 15 essential UI element classes High-resolution screenshots (1920x1080) Rich accessibility metadata… See the full description on the dataset page: https://huggingface.co/datasets/YashJain/UI-Elements-Detection-Dataset.
O
Open Data BR Site Analytics - Top 10 Assets Viewed or Downloaded
data.brla.gov
application/rdfxml +5
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Open Data BR Site Analytics - Top 10 Assets Viewed or Downloaded [Dataset]. https://data.brla.gov/dataset/Open-Data-BR-Site-Analytics-Top-10-Assets-Viewed-o/ie4p-gccw
Explore at:
tsv, application/rssxml, json, csv, application/rdfxml, xmlAvailable download formats
Dataset updated
Jun 28, 2025
Description
This dataset provides detail on how all assets on a domain are being used (e.g. views, downloads, API reads).
User activity is provided by date, asset uid, asset type, asset name, access type and user segment. Please see Site Analytics: Asset Access for more detail about these fields.
The dataset will reflect new Asset Access records within a day of when they occur.
Z
Dataset used for HTTPS traffic classification using packet burst statistics
data.niaid.nih.gov
zenodo.org
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cejka Tomas (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
Explore at:
Dataset updated
Apr 11, 2022
Dataset provided by
Hynek Karel
Cejka Tomas
Tropkova Zdena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are publishing a dataset we created for the HTTPS traffic classification.

Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

Live Video Stream Twitch, Czech TV, YouTube Live

Video Player DailyMotion, Stream.cz, Vimeo, YouTube

Music Player AppleMusic, Spotify, SoundCloud

File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

Website and Other Traffic Websites from Alexa Top 1M list
A
Most Viewed Digital Records in City Archives Digital Repository
data.boston.gov
csv
Updated Apr 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archives and Record Management (2019). Most Viewed Digital Records in City Archives Digital Repository [Dataset]. https://data.boston.gov/dataset/most-viewed-digital-records-in-city-archives-digital-repository
Explore at:
csv(7531), csv(7671), csv(7791), csv, csv(7727), csv(7393), csv(7760), csv(7735), csv(7559)Available download formats
Dataset updated
Apr 12, 2019
Dataset authored and provided by
Archives and Record Management
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Monthly statistics for most viewed digital records in the City Archives Digital Repository.
o
Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA
open.africa
Updated Feb 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/artisanal-mining-site-visits-in-eastern-drc
Explore at:
Dataset updated
Feb 7, 2019
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Democratic Republic of the Congo
Description
IPIS has collected data on artisanal mining sites since 2009, and made it publicly accessible on webmaps and in analytical reports. The upgraded map presents new mining sites, bringing the total to more than 2400 sites visited as recently as December 2017. New information on the mining sites has been included. A new layer has been added displaying hundreds of roadblocks. The latest update of the map has been supported by the International Organization for Migration (IOM) in the DRC, through the USAID funded Responsible Minerals Trade (RMT) project
Greek privacy policies dataset from PCI 2023 paper: "A privacy policies...
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Dec 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Kapitsaki; Maria Papoutsoglou; Georgia Kapitsaki; Maria Papoutsoglou (2023). Greek privacy policies dataset from PCI 2023 paper: "A privacy policies dataset in Greek in the GDPR era" [Dataset]. http://doi.org/10.5281/zenodo.10435881
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10435881
Dataset updated
Dec 27, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georgia Kapitsaki; Maria Papoutsoglou; Georgia Kapitsaki; Maria Papoutsoglou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of privacy policies in the Greek language, with policies coming from top visited websites in Greece with a privacy policy in the Greek language.

The dataset, as well as results of its analysis are included.

if you want to use this dataset, please cite the relevant conference publication:

Georgia M. Kapitsaki and Maria Papoutsoglou, "A privacy policies dataset in Greek in the GDPR era, in Proceedings of the 27th Pan-Hellenic Conference on Informatics, PCI 2023.
What social Media People like the most and why?
kaggle.com
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Luquez (2023). What social Media People like the most and why? [Dataset]. https://www.kaggle.com/ninaluquez/what-social-media-people-like-the-most-and-why/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nina Luquez
Description
Dataset

This dataset was created by Nina Luquez

Contents
News Portal User Interactions by Globo.com
kaggle.com
zip
Updated Apr 16, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Moreira (2019). News Portal User Interactions by Globo.com [Dataset]. https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-globocom
Explore at:
zip(377105112 bytes)Available download formats
Dataset updated
Apr 16, 2019
Authors
Gabriel Moreira
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.

The first version (v1) (download) of this dataset was released for reproducibility of the experiments presented in the following paper:

Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. 2018. News Session-Based Recommendations using Deep Neural Networks. In 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), October 6, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3270323.3270328

A second version (v2) (download) of this dataset was made available for reproducibility of the experiments presented in the following paper. Compared to the v1, the only differences are:

Included four additional user contextual attributes (click_os, click_country, click_region, click_referrer_type)

Removed repeated clicks (clicks in the same articles) within sessions. Those sessions with less than two clicks (minimum for the next-click prediction task) were removed

Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adilson Marques da Cunha. 2019. Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks. arXiv preprint arXiv:1904.10367, 49 pages

You are not allowed to use this dataset for commercial purposes, only with academic objectives (like education or research). If used for research, please cite the above papers.

Content

The dataset contains a sample of user interactions (page views) in G1 news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

It is composed by three files/folders:

clicks.zip - Folder with CSV files (one per hour), containing user sessions interactions in the news portal.

articles_metadata.csv - CSV file with metadata information about all (364047) published articles

articles_embeddings.pickle Pickle (Python 3) of a NumPy matrix containing the Article Content Embeddings (250-dimensional vectors), trained upon articles' text and metadata by the CHAMELEON's ACR module (see paper for details) for 364047 published articles.
P.s. The full text of news articles could not be provided due to license restrictions, but those embeddings can be used by Neural Networks to represent their content. See this paper for a t-SNE visualization of these embeddings, colored by category.

Acknowledgements

I would like to acknowledge Globo.com for providing this dataset for this research and for the academic community, in special to Felipe Ferreira for preparing the original dataset by Globo.com.

Dataset banner photo by rawpixel on Unsplash

Inspiration

This dataset might be very useful if you want to implement and evaluate hybrid and contextual news recommender systems, using both user interactions and articles content and metadata to provide recommendations. You might also use it for analytics, trying to understand how users interactions in a news portal are distributed by user, by article, or by category, for example.

If you are interested in a dataset of user interactions on articles with the full text provided, to experiment with some different text representations using NLP, you might want to take a look in this smaller dataset.
Google Trends and Wikipedia Page Views
zenodo.org
explore.openaire.eu
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitsuo Yoshida; Mitsuo Yoshida (2020). Google Trends and Wikipedia Page Views [Dataset]. http://doi.org/10.5281/zenodo.14539
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14539
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mitsuo Yoshida; Mitsuo Yoshida
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Abstract (our paper)

The frequency of a web search keyword generally reflects the degree of public interest in a particular subject matter. Search logs are therefore useful resources for trend analysis. However, access to search logs is typically restricted to search engine providers. In this paper, we investigate whether search frequency can be estimated from a different resource such as Wikipedia page views of open data. We found frequently searched keywords to have remarkably high correlations with Wikipedia page views. This suggests that Wikipedia page views can be an effective tool for determining popular global web search trends.

Data

personal-name.txt.gz:
The first column is the Wikipedia article id, the second column is the search keyword, the third column is the Wikipedia article title, and the fourth column is the total of page views from 2008 to 2014.

personal-name_data_google-trends.txt.gz, personal-name_data_wikipedia.txt.gz:
The first column is the period to be collected, the second column is the source (Google or Wikipedia), the third column is the Wikipedia article id, the fourth column is the search keyword, the fifth column is the date, and the sixth column is the value of search trend or page view.

Publication

This data set was created for our study. If you make use of this data set, please cite:
Mitsuo Yoshida, Yuki Arase, Takaaki Tsunoda, Mikio Yamamoto. Wikipedia Page View Reflects Web Search Trend. Proceedings of the 2015 ACM Web Science Conference (WebSci '15). no.65, pp.1-2, 2015.
http://dx.doi.org/10.1145/2786451.2786495
http://arxiv.org/abs/1509.02218 (author-created version)

Note

The raw data of Wikipedia page views is available in the following page.
http://dumps.wikimedia.org/other/pagecounts-raw/
w
Data.sa website statistics
data.wu.ac.at
csv
Updated Oct 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
South Australian Governments (2016). Data.sa website statistics [Dataset]. https://data.wu.ac.at/odso/data_gov_au/MzAyYzA1OWQtMTk1Zi00ODYzLTk3MmMtMGI1ODQ3ZDRiNTZi
Explore at:
csvAvailable download formats
Dataset updated
Oct 27, 2016
Dataset provided by
South Australian Governments
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Data.sa.gov.au is a directory for the openly licensed datasets from South Australian Government departments. This dataset contains site statistics for data.sa, including the most viewed dataset pages, visitor browser types, device types, etc.

Facebook

Twitter

Click to copy link

Link copied

Cite

DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website

Traces captured by visiting the top 1500 website

Traffic captured by visiting the top 1500 most visited sites ranked by Alexa

Explore at:

zip(5852806 bytes)Available download formats

Dataset updated

Aug 25, 2021

Authors

DNS_dataset

Description

Dataset

This dataset was created by DNS_dataset

Clear search

Close search

Google apps

Main menu

Traces captured by visiting the top 1500 website

Dataset

Contents

Colombia: most visited websites 2024, by unique visitors

(Dataset) The most visited health websites in the world

Most visited websites by hierachycal categories

Context

Content

Acknowledgements

Inspiration

‘Popular Website Traffic Over Time ’ analyzed by Analyst-2

About this dataset

Background

Methodology

Source

How to use this dataset

Acknowledgements

Start A New Notebook!

Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...

Top 50 Pages By Pageviews on Austintexas.gov -

Dataset used for detecting DNS over HTTPS by Machine Learning.

Most popular websites in the Netherlands 2015

1k_Website_Screenshots_and_Metadata

UI-Elements-Detection-Dataset

Open Data BR Site Analytics - Top 10 Assets Viewed or Downloaded

Dataset used for HTTPS traffic classification using packet burst statistics

Most Viewed Digital Records in City Archives Digital Repository

Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA

Greek privacy policies dataset from PCI 2023 paper: "A privacy policies...

What social Media People like the most and why?

Dataset

Contents

News Portal User Interactions by Globo.com

Context

Content

Acknowledgements

Inspiration

Google Trends and Wikipedia Page Views

Data.sa website statistics

Traces captured by visiting the top 1500 website

Traffic captured by visiting the top 1500 most visited sites ranked by Alexa

Dataset

Contents