64 datasets found

Hong Kong's most visited websites 2024
statista.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Hong Kong's most visited websites 2024 [Dataset]. https://www.statista.com/statistics/1054071/hong-kong-most-popular-websites/
Explore at:
Dataset updated
Feb 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 1, 2024 - Nov 30, 2024
Area covered
Hong Kong
Description
Between September and November 2024, google.com was the most visited website in Hong Kong with 338 million average monthly visits. In terms of monthly traffic and pages per visit, international news website Yahoo.com ranked higher than the local news website hk01.com.
Google Analytics & Twitter dataset from a movies, TV series and videogames...
figshare.com
portalcientificovalencia.univeuropea.com
txt
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Yeste (2024). Google Analytics & Twitter dataset from a movies, TV series and videogames website [Dataset]. http://doi.org/10.6084/m9.figshare.16553061.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16553061.v4
Dataset updated
Feb 7, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Víctor Yeste
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
Buzzfeednews.com average visit length per user worldwide 2022-2024
statista.com
Updated Feb 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Buzzfeednews.com average visit length per user worldwide 2022-2024 [Dataset]. https://www.statista.com/statistics/1477780/buzzfeednews-com-time-spent-per-visit/
Explore at:
Dataset updated
Feb 15, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Apr 2022 - Jan 2024
Area covered
World
Description
In the period between its release in November 2022 and January 2024, Buzzfeednews.com saw the average duration of global visits to its web domain swing sensibly. Even in spite of the website's news division shutting down in April 2023, visitors worldwide spent *** seconds on average in the platform's domain in the last examined month, equating to ** minutes and ** seconds. The peak of the news website session length happened in November 2023, when users worldwide spent an average of *** seconds on the web page.

MIT AI news dataset

kaggle.com

zip

Updated Aug 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Yousef Fawzi (2025). MIT AI news dataset [Dataset]. https://www.kaggle.com/datasets/losif01/mit-ai-news-dataset

Explore at:

zip(808350 bytes)Available download formats

Dataset updated

Aug 21, 2025

Authors

Yousef Fawzi

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

📄 Dataset Description

This dataset contains articles scraped from the Massachusetts Institute of Technology (MIT) News website, specifically focusing on topics related to Artificial Intelligence, Machine Learning, Robotics, and Emerging Technologies.

The data was collected from the MIT News topic page:
👉 https://news.mit.edu/topic/artificial-intelligence2

Each entry includes: - Title of the article - Author(s) - Publication date - Summary (dek) - Full article body text - URL to the original article - Link to related research paper (e.g., Nature, Science) when available

The dataset spans multiple research domains, including: - AI for drug discovery & healthcare - Protein language models - Sustainable AI and eco-driving - Robotics and embodied intelligence - Chemistry and materials science - Climate and clean energy

This dataset is ideal for: - Natural Language Processing (NLP) tasks (summarization, topic modeling, sentiment analysis) - Trend analysis in AI and scientific research - Text classification and information retrieval - Educational projects and AI literacy - Knowledge graph construction of AI research

⚠️ Important Notes

All content is copyright of MIT News and is shared under non-commercial, educational use only.
This dataset was collected respectfully, with delays between requests, in accordance with MIT’s robots.txt and ethical web scraping practices.
The full text of articles is included to enable research, but users are encouraged to cite original sources and visit the MIT News website for the latest updates.

📁 Columns

Column	Description
`title`	Article headline
`author`	Author(s) of the article
`publication_date`	Human-readable publication date
`datetime`	ISO-formatted publication timestamp
`summary`	Article summary (lead paragraph)
`body`	Full article text
`paper_link`	URL to the related research paper (e.g., Nature)
`url`	Direct link to the MIT News article

🔗 Source

Official Website: https://news.mit.edu
Topic Page: https://news.mit.edu/topic/artificial-intelligence2

🙌 Inspiration

Use this dataset to: - Track how AI is being applied across scientific disciplines - Build a news aggregator for AI research - Train a model to predict research trends - Create a search engine for MIT’s AI breakthroughs

✅ License

This dataset is shared under Kaggle’s Terms of Service for non-commercial, educational, and research purposes.
The original content remains the property of MIT News and should be properly attributed.

r
News articles and front pages from 19 Swedish news sites during the...
researchdata.se
Updated Nov 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter M. Dahlgren (2021). News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021 [Dataset]. http://doi.org/10.5878/d18f-q220
Explore at:
(477962370), (255819)Available download formats
Unique identifier
https://doi.org/10.5878/d18f-q220
Dataset updated
Nov 2, 2021
Dataset provided by
University of Gothenburg
Authors
Peter M. Dahlgren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2021 - Apr 26, 2021
Area covered
Sweden
Description
This dataset contains news articles from Swedish news sites during the covid-19 corona pandemic 2020–2021. The purpose was to develop and test new methods for collection and analyses of large news corpora by computational means. In total, there are 677,151 articles collected from 19 news sites during 2020-01-01 to 2021-04-26. The articles were collected by scraping all links on the homepages and main sections of each site every two hours, day and night.

The dataset also includes about 45 million timestamps at which the articles were present on the front pages (homepages and main sections of each news site, such as domestic news, sports, editorials, etc.). This allows for detailed analysis of what articles any reader likely was exposed to when visiting a news site. The time resolution is (as stated previously) two hours, meaning that you can detect changes in which articles were on the front pages every two hours.

The 19 news sites are aftonbladet.se, arbetet.se, da.se, di.se, dn.se, etc.se, expressen.se, feministisktperspektiv.se, friatider.se, gp.se, nyatider.se, nyheteridag.se, samnytt.se, samtiden.nu, svd.se, sverigesradio.se, svt.se, sydsvenskan.se and vlt.se.

Due to copyright, the full text is not available but instead transformed into a document-term matrix (in long format) which contains the frequency of all words for each article (in total, 80 million words). Each article also includes extensive metadata that was extracted from the articles themselves (URL, document title, article heading, author, publish date, edit date, language, section, tags, category) and metadata that was inferred by simple heuristic algorithms (page type, article genre, paywall).

The dataset consists of the following: article_metadata.csv (53 MB): The file contains information about each news article, one article per row. In total, there are 677,151 observations and 17 variables.

article_text.csv (236 MB): The file contains the id of each news article and how many times (count) a specific word occurs in the news article. The file contains 80,090,784 observations and 3 variables in long format.

frontpage_timestamps.csv (175 MB): The file contains when each news article was found on the front page (homepage and main sections) of the news sites. The file contains 45,337,740 observations and 4 variables in long format.

More information about the content in the files is found in the README-file. In it you will also find the R-script for using the data.
Most popular online news properties in Colombia 2022, by average views per...
statista.com
Updated Jun 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Most popular online news properties in Colombia 2022, by average views per visitor [Dataset]. https://www.statista.com/statistics/1251581/online-news-sites-views-per-visitor-colombia/
Explore at:
Dataset updated
Jun 15, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2022
Area covered
Colombia
Description
In May 2022, Eltiempo.com had an average of 11 views per visitor, the highest figure among Colombia's news and information-oriented online properties with the highest number of unique users. Semana.com and Pulzo.com followed, each with an average of seven views per visitor. El Tiempo and Pulso were also among Colombia's most popular online news brands in 2022.
h
bbc-news
huggingface.co
opendatalab.com
Updated Jun 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SetFit (2022). bbc-news [Dataset]. https://huggingface.co/datasets/SetFit/bbc-news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 28, 2022
Dataset authored and provided by
SetFit
Description
BBC News Topic Dataset

Dataset on BBC News Topic Classification consisting of 2,225 articles published on the BBC News website corresponding during 2004-2005. Each article is labeled under one of 5 categories: business, entertainment, politics, sport or tech. Original source for this dataset:

Derek Greene, Pádraig Cunningham, “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering,” in Proc. 23rd International Conference on Machine learning (ICML’06)… See the full description on the dataset page: https://huggingface.co/datasets/SetFit/bbc-news.
CBS News/New York Times National Surveys, 1982
icpsr.umich.edu
ascii, sas, spss
Updated Jan 12, 2006
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inter-university Consortium for Political and Social Research [distributor] (2006). CBS News/New York Times National Surveys, 1982 [Dataset]. http://doi.org/10.3886/ICPSR09053.v1
Explore at:
spss, ascii, sasAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR09053.v1
Dataset updated
Jan 12, 2006
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
License
https://www.icpsr.umich.edu/web/ICPSR/studies/9053/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/9053/terms
Time period covered
1982
Area covered
United States
Description
This data collection is part of a continuing series of surveys that solicit public opinion on the presidency and on a range of other political and social issues. Respondents were asked to give their opinions of President Ronald Reagan and his handling of the presidency, foreign policy, and the economy, as well as their views on the Israeli-Lebanese conflict, El Salvador, and the Equal Rights Amendment. These national surveys were administered by telephone to one eligible respondent per household. The data are contained in seven files. Part 1, January 1982, includes data about the Reagan presidency and standard CBS demographic or background variables. Part 2, March 1982, contains questions on El Salvador and the policies of the Reagan Administration. Part 3, May 1982, contains questions on the nuclear freeze movement. Part 4, June 1982 (Part 1), contains a small set of background variables, and several questions about the Israeli-Lebanese conflict and Alexander Haig's resignation as Secretary of State. Part 5, June 1982 (All), contains data about the Equal Rights Amendment and women's movement. Part 6, September 1982, and Part 7, October 1982, are pre-election surveys and they include a number of questions relating to the forthcoming congressional elections, evaluation of the Reagan Administration's policies, the political parties, the impact of various issues on the elections, and the respondent's past voting behavior as well as current voting intentions. Information on demographic characteristics, such as age, sex, race, religion, income, and education, is available for each respondent.
German news headlines (politics and economics)
kaggle.com
zip
Updated Jan 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MatthiasS (2022). German news headlines (politics and economics) [Dataset]. https://www.kaggle.com/datasets/matthiasse/german-news-headlines-politics-and-economics/discussion
Explore at:
zip(344492 bytes)Available download formats
Dataset updated
Jan 7, 2022
Authors
MatthiasS
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Context

The project was started as a web-scraping exercise to get more experience particularly with the scrapy framework. Since I daily check news from several sources I decided to get a webscraper do the work for me and look for the interesting headlines from politics and economics. The news sources have been anonymised and the licence limited to non-commercial use since this is the prerequisite to scrape the data from those homepages.

Content

In the csv file you find around 8400 records of news headlines from 7 different sources. For each record a teaser (or sub-headline) and a headline is provided.

Acknowledgements

My thanks go to Upendra who has a great Youtube channel on webscraping (https://www.youtube.com/user/eupendras).

Inspiration

All data enthusiasts are highly welcome to use the data and make something out of it. I will try and practise topic modelling as well as translation tasks with transformer models. Any inspiration for this or comments on my notebooks (which I will publish shortly) are highly appreciated!
Market News Price Dataset
fisheries.noaa.gov
datasets.ai
+1more
Updated Aug 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Northeast Fisheries Science Center (NEFSC) (2022). Market News Price Dataset [Dataset]. https://www.fisheries.noaa.gov/inport/item/26732
Explore at:
Dataset updated
Aug 9, 2022
Dataset provided by
Northeast Fisheries Science Center
Authors
Northeast Fisheries Science Center (NEFSC)
Time period covered
Jul 1, 2012 - Nov 22, 2125
Area covered
New York, New England, Gloucester, MA, Portland, ME, New Bedford, MA
Description
Real-time price data collected by the Boston Market News Reporter. The NOAA Fisheries' "Fishery Market News" began operations in New York City on February 14, 1938. The primary function of this joint Federal/industry program is to provide accurate and unbiased reports depicting current conditions affecting the trade in fish and fishery products. The Boston and New York Market News Reports are...
multinews_dense_oracle
huggingface.co
Updated Feb 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2023). multinews_dense_oracle [Dataset]. https://huggingface.co/datasets/allenai/multinews_dense_oracle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This is a copy of the Multi-News dataset, except the input source documents of the train, validation, and test splits have been replaced by a dense retriever. The retrieval pipeline used:

query: The summary field of each example corpus: The union of all documents in the train, validation and test splits retriever: facebook/contriever-msmarco via PyTerrier with default settings top-k strategy: "oracle", i.e. the number of documents retrieved, k, is set as the original number of input documents… See the full description on the dataset page: https://huggingface.co/datasets/allenai/multinews_dense_oracle.
h
ag_news_test
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SZ, ag_news_test [Dataset]. https://huggingface.co/datasets/szhuggingface/ag_news_test
Explore at:
Authors
SZ
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Stratified and Nested Subsets of AG News for Performance Benchmarking

Dataset Summary

This repository contains stratified and progressively smaller, nested subsets of the AG News dataset. It was specifically created to benchmark the performance (e.g., accuracy, training time, and resource usage) of fine-tuning language models on varying amounts of training data. By using stratified samples, each training subset maintains the original class distribution of the AG News… See the full description on the dataset page: https://huggingface.co/datasets/szhuggingface/ag_news_test.
h
news-categories
huggingface.co
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Momentum AI (2025). news-categories [Dataset]. https://huggingface.co/datasets/momentum-lab/news-categories
Explore at:
Dataset updated
Oct 1, 2025
Authors
Momentum AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
English News Headline Dataset

Overview

This dataset contains 50,000 English news headlines categorized into 10 topical classes, designed for text classification and NLP studies such as news topic modeling, transfer learning, and zero‑shot evaluation. Each record includes:

title: news headline text
topic: one of ten predefined categories
genre: one of four predefined descriptor of the story style (e.g., Informational, Analysis)
source: media outlet name
date:… See the full description on the dataset page: https://huggingface.co/datasets/momentum-lab/news-categories.
ABC News Panama Poll #1, December 1989
icpsr.umich.edu
ascii, sas, spss +1
Updated Jul 3, 2007
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABC News (2007). ABC News Panama Poll #1, December 1989 [Dataset]. http://doi.org/10.3886/ICPSR09433.v1
Explore at:
spss, stata, sas, asciiAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR09433.v1
Dataset updated
Jul 3, 2007
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
ABC News
License
https://www.icpsr.umich.edu/web/ICPSR/studies/9433/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/9433/terms
Time period covered
Dec 20, 1989
Area covered
United States
Description
This survey focused on the United States military action in Panama. Respondents were asked if they approved of the way President George Bush was handling the situation, if they approved of the United States' having sent military forces to overthrow Manuel Noriega, if they would still approve if the action resulted in a large number of Panamanian civilian casualties, if the reasons Bush had given for invading Panama were good enough to warrant the action, and if sending military forces into Panama to overthrow Noriega was legal under United States law. Other topics covered include comparisons to Viet Nam, using similar military action in Nicaragua, the level of danger to Americans in Panama, Bush's trip to Colombia to discuss the drug problem, and if the action affected the respondents' feelings of pride in the United States. Background information on respondents includes political alignment, age, sex, and state/region of residence.
g
Ten Thousand German News Articles Dataset
tblock.github.io
kaggle.com
csv
Updated Mar 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Block (2019). Ten Thousand German News Articles Dataset [Dataset]. https://tblock.github.io/10kGNAD/
Explore at:
csvAvailable download formats
Dataset updated
Mar 5, 2019
Authors
T. Block
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/
News Articles
kaggle.com
zip
Updated May 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
harishaaram (2018). News Articles [Dataset]. https://www.kaggle.com/harishcscode/all-news-articles-from-home-page-media-house
Explore at:
zip(327948548 bytes)Available download formats
Dataset updated
May 6, 2018
Authors
harishaaram
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Data is collected from various media houses home page to see which News media shares/writes articles with less gory words.

Content

Datasource is obtained from these websites which are downloaded from a time period of Oct 2017 to Nov 2017:

1. "http://www.nytimes.com/" 2. "http://www.foxnews.com/" 3. "http://www.reuters.com/" 4. "http://www.cnn.com/" 5. "http://www.huffingtonpost.com/"

Each folder is named in the mmddyyyy convention. And Each CSV file has the media house name as the file name(eg: reuters.csv). The CSV has the following columns:

TITLE: the Title of the article.

SUMMARY: first few lines of the article's text.

TEXT: Full text inside the article

URL: web link to the article.

KEYWORDS: important words in the article.

Acknowledgements

This dataset is under CC0: public domain license.

Inspiration

All around the world both good and bad happens, and we get to know only those that are exposed to us. And, that’s the primary responsibility of the media. But the bigger responsibility of these media houses is the way in which they express the content to the people.

A responsible media house’s content should be original, unbiased, free of exaggeration and should be very sensitive in handling the emotions of it’s readers and viewers. A same story could be told in different ways and these different ways could definitely trigger different emotions among it’s readers.

It is known that we become who we are by what we say and what we read. Reading a story that’s filled with positive words would make us feel more positive and vice versa. So the wordings of a content definitely plays an equal role as that of the content itself.

This dataset stands as sample to find out which media house conveys the NEWS in more optimistic way!!!
AllSides : Ratings of bias in electronic media
kaggle.com
zip
Updated Sep 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Supratim Haldar (2021). AllSides : Ratings of bias in electronic media [Dataset]. https://www.kaggle.com/datasets/supratimhaldar/allsides-ratings-of-bias-in-electronic-media
Explore at:
zip(32548 bytes)Available download formats
Dataset updated
Sep 23, 2021
Authors
Supratim Haldar
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Context

Media is the 4th pillar of democracy, so they must execute their duty with rightfulness. While majority does so, very often news articles gets contaminated with personal perspectives of journalists authoring those articles, or the beliefs of people running those media houses. As per Wikipedia definition - media bias is the bias or perceived bias of journalists and news producers within the mass media in the selection of events and stories that are reported and how they are covered.

Content

https://www.allsides.com is doing an wonderful job in analyzing the bias of renowned media houses, and showing how a particular news is presented with complete different perspectives by different media publications. Based on analysis, each media publication is assigned a "bias" direction (left, right or neutral). General public can vote to express their opinion if they agree to this analysis. The details in captured in https://www.allsides.com/media-bias/media-bias-ratings and constantly updated based on new votes. The content of this dataset is scraped from this and subsequent pages.

Acknowledgements

https://www.allsides.com is the owner of this data and holds all rights to it. Many thanks to them for their effort!

Inspiration

A deeper analysis can reveal which side most of the media houses are leaned towards. The analysis can further be extended by comparing news articles on same event by different media publications, and as a final step to build a classifier to find biasness of any random article on the internet just by reading it. This might help fight the battle against fake news as well.

Allsides will love to see any work which brings out insightful information from this data. Please feel free to share your work with Allsides (https://www.allsides.com/contact).

Licenses and Attribution

AllSides Media Bias Ratings by AllSides.com are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You may use this data for research or noncommercial purposes provided you include this attribution.

For commercial use, or to request this data as a CSV or JSON file, go to www.allsides.com/contact.
Top 100 YouTube Channels - News & Politics Category
vidiq.com
Updated May 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vidIQ (2023). Top 100 YouTube Channels - News & Politics Category [Dataset]. https://vidiq.com/youtube-stats/top/category/news/
Explore at:
Dataset updated
May 8, 2023
Dataset authored and provided by
vidIQ
Time period covered
Dec 2, 2025
Area covered
YouTube, Worldwide
Variables measured
rank, subscribers, total views, video count
Description
Comprehensive ranking dataset of the top 100 YouTube channels in the News & Politics category. This dataset features 100 channels with detailed statistics including subscriber counts, total video views, video count, and global rankings. The leading channel has 74,400,000 subscribers and 42,602,103,612 total views. Each entry includes comprehensive metrics to analyze channel performance, growth trends, and competitive positioning. This dataset is regularly updated to reflect the latest YouTube channel statistics and ranking changes, providing valuable insights for content creators, marketers, and researchers analyzing YouTube ecosystem trends and channel performance benchmarks.
Leading websites worldwide 2025, by monthly visits
statista.com
boostndoto.org
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading websites worldwide 2025, by monthly visits [Dataset]. https://www.statista.com/statistics/1201880/most-visited-websites-worldwide/
Explore at:
Dataset updated
Oct 29, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Aug 2025
Area covered
Worldwide
Description
In August 2025, Google.com was the most visited website worldwide, with an average of 98.2 billion monthly visits. The platform has maintained its leading position since June 2010, when it surpassed Yahoo to take first place. YouTube ranked second during the same period, recording over 48 billion monthly visits. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.
multinews_sparse_max
huggingface.co
Updated Jan 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2023). multinews_sparse_max [Dataset]. https://huggingface.co/datasets/allenai/multinews_sparse_max
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 27, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This is a copy of the Multi-News dataset, except the input source documents of its test split have been replaced by a sparse retriever. The retrieval pipeline used:

query: The summary field of each example corpus: The union of all documents in the train, validation and test splits retriever: BM25 via PyTerrier with default settings top-k strategy: "max", i.e. the number of documents retrieved, k, is set as the maximum number of documents seen across examples in this dataset, in this case… See the full description on the dataset page: https://huggingface.co/datasets/allenai/multinews_sparse_max.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Hong Kong's most visited websites 2024 [Dataset]. https://www.statista.com/statistics/1054071/hong-kong-most-popular-websites/

Hong Kong's most visited websites 2024

Explore at:

Dataset updated

Feb 15, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Sep 1, 2024 - Nov 30, 2024

Area covered

Hong Kong

Description

Between September and November 2024, google.com was the most visited website in Hong Kong with 338 million average monthly visits. In terms of monthly traffic and pages per visit, international news website Yahoo.com ranked higher than the local news website hk01.com.

Clear search

Close search

Google apps

Main menu

Hong Kong's most visited websites 2024

Google Analytics & Twitter dataset from a movies, TV series and videogames...

Buzzfeednews.com average visit length per user worldwide 2022-2024

MIT AI news dataset

📄 Dataset Description

⚠️ Important Notes

📁 Columns

🔗 Source

🙌 Inspiration

✅ License

News articles and front pages from 19 Swedish news sites during the...

Most popular online news properties in Colombia 2022, by average views per...

bbc-news

CBS News/New York Times National Surveys, 1982

German news headlines (politics and economics)

Context

Content

Acknowledgements

Inspiration

Market News Price Dataset

multinews_dense_oracle

ag_news_test

news-categories

ABC News Panama Poll #1, December 1989

Ten Thousand German News Articles Dataset

News Articles

Context

Content

Acknowledgements

Inspiration

AllSides : Ratings of bias in electronic media

Context

Content

Acknowledgements

Inspiration

Licenses and Attribution

Top 100 YouTube Channels - News & Politics Category

Leading websites worldwide 2025, by monthly visits

multinews_sparse_max

Hong Kong's most visited websites 2024