The statistic shows the leading news websites in the Philippines as of May 2019, ranked by daily pageviews per visitor. Punch.dagupan.com ranked first with 3.7 daily pageviews per visitor, followed by Sunstar.com.ph with 2.87 daily pageviews per visitor.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
Between September and November 2024, google.com was the most visited website in Hong Kong with 338 million average monthly visits. In terms of monthly traffic and pages per visit, international news website Yahoo.com ranked higher than the local news website hk01.com.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.
The first version (v1) (download) of this dataset was released for reproducibility of the experiments presented in the following paper:
Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. 2018. News Session-Based Recommendations using Deep Neural Networks. In 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), October 6, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3270323.3270328
A second version (v2) (download) of this dataset was made available for reproducibility of the experiments presented in the following paper. Compared to the v1, the only differences are:
Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adilson Marques da Cunha. 2019. Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks. arXiv preprint arXiv:1904.10367, 49 pages
You are not allowed to use this dataset for commercial purposes, only with academic objectives (like education or research). If used for research, please cite the above papers.
The dataset contains a sample of user interactions (page views) in G1 news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.
It is composed by three files/folders:
I would like to acknowledge Globo.com for providing this dataset for this research and for the academic community, in special to Felipe Ferreira for preparing the original dataset by Globo.com.
Dataset banner photo by rawpixel on Unsplash
This dataset might be very useful if you want to implement and evaluate hybrid and contextual news recommender systems, using both user interactions and articles content and metadata to provide recommendations. You might also use it for analytics, trying to understand how users interactions in a news portal are distributed by user, by article, or by category, for example.
If you are interested in a dataset of user interactions on articles with the full text provided, to experiment with some different text representations using NLP, you might want to take a look in this smaller dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains news articles from Swedish news sites during the covid-19 corona pandemic 2020–2021. The purpose was to develop and test new methods for collection and analyses of large news corpora by computational means. In total, there are 677,151 articles collected from 19 news sites during 2020-01-01 to 2021-04-26. The articles were collected by scraping all links on the homepages and main sections of each site every two hours, day and night.
The dataset also includes about 45 million timestamps at which the articles were present on the front pages (homepages and main sections of each news site, such as domestic news, sports, editorials, etc.). This allows for detailed analysis of what articles any reader likely was exposed to when visiting a news site. The time resolution is (as stated previously) two hours, meaning that you can detect changes in which articles were on the front pages every two hours.
The 19 news sites are aftonbladet.se, arbetet.se, da.se, di.se, dn.se, etc.se, expressen.se, feministisktperspektiv.se, friatider.se, gp.se, nyatider.se, nyheteridag.se, samnytt.se, samtiden.nu, svd.se, sverigesradio.se, svt.se, sydsvenskan.se and vlt.se.
Due to copyright, the full text is not available but instead transformed into a document-term matrix (in long format) which contains the frequency of all words for each article (in total, 80 million words). Each article also includes extensive metadata that was extracted from the articles themselves (URL, document title, article heading, author, publish date, edit date, language, section, tags, category) and metadata that was inferred by simple heuristic algorithms (page type, article genre, paywall).
The dataset consists of the following: article_metadata.csv (53 MB): The file contains information about each news article, one article per row. In total, there are 677,151 observations and 17 variables.
article_text.csv (236 MB): The file contains the id of each news article and how many times (count) a specific word occurs in the news article. The file contains 80,090,784 observations and 3 variables in long format.
frontpage_timestamps.csv (175 MB): The file contains when each news article was found on the front page (homepage and main sections) of the news sites. The file contains 45,337,740 observations and 4 variables in long format.
More information about the content in the files is found in the README-file. In it you will also find the R-script for using the data.
In May 2022, Eltiempo.com had an average of 11 views per visitor, the highest figure among Colombia's news and information-oriented online properties with the highest number of unique users. Semana.com and Pulzo.com followed, each with an average of seven views per visitor. El Tiempo and Pulso were also among Colombia's most popular online news brands in 2022.
In the period between its release in November 2022 and January 2024, Buzzfeednews.com saw the average duration of global visits to its web domain swing sensibly. Even in spite of the website's news division shutting down in April 2023, visitors worldwide spent *** seconds on average in the platform's domain in the last examined month, equating to ** minutes and ** seconds. The peak of the news website session length happened in November 2023, when users worldwide spent an average of *** seconds on the web page.
For more information on CDC.gov metrics please see http://www.cdc.gov/metrics/
BBC News Topic Dataset
Dataset on BBC News Topic Classification consisting of 2,225 articles published on the BBC News website corresponding during 2004-2005. Each article is labeled under one of 5 categories: business, entertainment, politics, sport or tech. Original source for this dataset:
Derek Greene, Pádraig Cunningham, “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering,” in Proc. 23rd International Conference on Machine learning (ICML’06)… See the full description on the dataset page: https://huggingface.co/datasets/SetFit/bbc-news.
https://www.icpsr.umich.edu/web/ICPSR/studies/4494/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4494/terms
This special topic poll, fielded May 6-8, 1997, is part of a continuing series of monthly surveys that solicit public opinion on the presidency and on a range of other political and social issues. Respondents were asked to give their opinions of President Bill Clinton and his handling of the presidency. Views were sought on the events surrounding the 1996 Democratic fundraising activities and the White House's involvement in them, whether President Clinton and Vice President Gore did anything wrong or illegal, and whether Congress should investigate the matter. Respondents gave their opinions of Vice President Al Gore, Secretary of State Madeleine Albright, Speaker of the House Newt Gingrich, and how well members of the United States Congress were handling their jobs. Several questions asked how satisfied respondents were with their job, whether it was their dream job, and if not, what their dream job would be. Other questions addressed whether lying and keeping secrets was ever justified, how often respondents lied to others and were lied to, and their ability to tell a lie and detect when others were lying. Additional topics addressed the most important quality in a doctor, how concerned respondents were about germs, whether tobacco companies were telling the truth about the health risks of smoking, and whether they should be held legally responsible for smoking-related illness and deaths. Information was also collected on whether respondents smoked, whether they had a child in the ninth grade, and whether they identified themselves as multiracial. Demographic variables include sex, race, age, household income, education level, employment status, occupation, religious preference, frequency of religious attendance, political party affiliation, political philosophy, voter participation history and registration status, length of time living at current residence, the presence of children and teenagers in the household, and type of residential area (e.g., urban or rural).
https://www.icpsr.umich.edu/web/ICPSR/studies/2924/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/2924/terms
This poll, fielded February 6-9, 2000, is part of an ongoing series of monthly surveys that solicit public opinion on the presidency and on a range of other political and social issues. Respondents were asked to give their opinions of President Bill Clinton and his handling of the presidency, foreign policy, and the economy, as well as their views on the 2000 presidential election. Other survey questions elicited opinions on government representation at the national and local levels, what the single most important problem for the government was, and whether respondents would vote for a Democrat or a Republican if voting for a House Representative today. Respondents were asked if they had favorable or unfavorable opinions of Texas governor George Bush, former New Jersey senator Bill Bradley, Arizona senator John McCain, publisher Steve Forbes, conservative commentator Pat Buchanan, and talk show host Alan Keyes. Other questions asked if respondents were following the presidential campaign, if they would vote Democratic or Republican if they were voting today, if they would be voting in a caucus, whom they would vote for and if that was their final decision, and out of various possible Democrat/Republican pairings, which of the two they would vote for in a presidential election (e.g., McCain vs. Gore, Bush vs. Bradley, etc.). Another focus of this poll was race relations and the role of national and local government in addressing this issue. Questions probed respondents' knowledge of American Black history and to what degree public schools teach Black history, who the most important Black role model was, whether computers and the Internet would improve opportunities for Blacks, and whether respondents viewed the following organizations and persons favorably: the National Association for the Advancement of Colored People (NAACP), the Nation of Islam, General Colin Powell, Jesse Jackson, and Nation of Islam leader Louis Farrakhan. Respondents were asked whether America was ready for a Black president, whether they would vote for a party-nominated Black presidential candidate, which political party was more likely to try to fix race relations, what they thought about South Carolina's flying the Confederate flag, and whether the presidential candidates should express their opinions on this issue. Questions were asked regarding race equality, race relations in the respondent's community, how respondents viewed the existence and persistence of racial discrimination, whether it could be cured, and how important this issue was to the future of the United States. Respondents were asked additional questions concerning race relations, including whether the government should try to improve race relations, whether it was paying the appropriate amount of attention to the needs of minorities, and if the criminal justice system was racially biased. Further questions addressed attitudes and behaviors of the police toward individuals and minorities, including the use of inappropriate language, respectful behavior, and "racial profiling," and if police were considered friends or enemies. Regarding personal experiences with racism and perceptions of its relevance in society, respondents were asked how many Black people they worked with, how many lived in their neighborhoods, attended local public schools, and shopped at their stores, whether respondents made a point to patronize minority-owned businesses, how respondents perceived the number of white people who disliked Blacks and vice versa, and if the respondent had ever felt discriminated against and why. In regard to racism in the workplace, questions were asked to gauge respondents' opinions of affirmative action, personal experience with discrimination on the job or in trying to obtain a job, and whether successful Blacks had a duty to help other Blacks. Respondents were also asked if there were adequate numbers of Blacks employed as teachers, professional sports players, businesspersons who owned large companies, medical doctors, and coaches and team executives. Also asked were questions about opportunities to succeed in today's world as compared to the respondents' parents' generation and future generations. A set of questions was asked to assess perceptions of the poor in America, including whether respondents believed being poor was the result of a lack of effort or circumstances,
In November 2024, Google.com was the most popular website worldwide with 136 billion average monthly visits. The online platform has held the top spot as the most popular website since June 2010, when it pulled ahead of Yahoo into first place. Second-ranked YouTube generated more than 72.8 billion monthly visits in the measured period. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.
This dataset is derived from the Global News Dataset. Please refer to the original source (also cited below) and ensure that your use complies with its terms and conditions.
Webz.io News Dataset Repository
Introduction
Welcome to the Webz.io News Dataset Repository! This repository is created by Webz.io and is dedicated to providing free datasets of publicly available news articles. We release new datasets weekly, each containing around 1,000 news articles focused on… See the full description on the dataset page: https://huggingface.co/datasets/Jerry999/sds-news-rag.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
(see https://tblock.github.io/10kGNAD/ for the original dataset page)
This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.
English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.
Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.
In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise
.
The 10kGNAD uses the second part of the topic path, here Wirtschaft
, as class label.
In result the dataset can be used for multi-class classification.
I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.
I propose a stratifyed split of 10% for testing and the remaining articles for training.
To use the dataset as a benchmark dataset, please used the train.csv
and test.csv
files located in the project root.
Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project.
Make sure to install the requirements.
The original corpus.sqlite3
is required to extract the articles (download here (compressed) or here (uncompressed)).
This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
https://www.icpsr.umich.edu/web/ICPSR/studies/26943/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/26943/terms
This special topic poll is part of a continuing series of monthly surveys that solicits public opinion on the presidency and on a range of other political and social issues. In this poll, fielded February 2-4, 2009, respondents were asked whether they approved of the way Barack Obama was handling the presidency, foreign policy, the economy, and the campaign against terrorism. Opinions were collected about whether the country was going in the right direction, whether the condition of the economy was good, how long the recession would last, and what could be done to get the United States out of the recession. Respondents were asked their opinions of Speaker of the House Nancy Pelosi, Democrats in Congress, Republicans in Congress, and Congress as a whole. Several questions were asked about coal including questions that asked respondents whether they would approve of building plants that were powered by coal to generate electricity, whether it was a good idea to use coal to generate electricity, whether they thought doing so would contribute to global warming, whether they knew of any companies using technology to generate electricity from coal in a way that does not contribute to global warming, respondent's definition of "clean coal," and whether advertisements about "clean coal" technology had changed their opinion of whether it was possible to use coal to generate electricity in a way that was less likely to contribute to global warming. Other questions asked about the economic stimulus plan, how closely respondents had been following news about it, whether they approved of the federal government passing an economic stimulus bill, whether the bill would shorten the recession, and whether it was okay for the Democrats to pass the bill without the support of the Republicans in Congress. Additional topics addressed closing the United States prison in Guantanamo Bay, Cuba, abortion, job security, global warming, the concept of "nature versus nurture," and where people obtain their sense of morality. Demographic variables include sex, age, race, education level, marital status, household income, political party affiliation, political philosophy, voter registration status and participation history, religious preference, religious service attendance, and whether respondents considered themselves to be a born-again Christian.
https://www.icpsr.umich.edu/web/ICPSR/studies/8550/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8550/terms
The substantive common denominator in the surveys is a continuing evaluation of the Reagan presidency. Each survey also contains questions of topical relevance or questions about broader social issues. Respondents were queried about their attitudes towards the arms race and "Star Wars", Ronald Reagan and his domestic and foreign policies, tax reform, the federal deficit, the Vietnam War, Reagan's visit to the military cemetery in Bitburg, Central America, trade policies, the United Nations, AIDS, the Soviet Union, and religion and the Catholic church. One survey contains questions concerning race relations and public figures in New York City. Only New York City residents were interviewed for this particular survey. All surveys contain demographic information on respondents.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Newspaper segmentation dataset: Finlam
Dataset Summary
The Finlam dataset includes 149 French newspapers from the 19th to 20th centuries. Each newspaper contains multiple pages. Page images are resized to a fixed height of 2000 pixels. Each page contains multiple zones, with different information such as polygon, text, class, and order.
Split
set images newspapers
train 623 129
val 50 10
test 48 10
Languages
Most newspapers in… See the full description on the dataset page: https://huggingface.co/datasets/Teklia/Newspapers-finlam.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for news-12factor
Dataset Description
~20k articles labeled left, right, or center by the editors of allsides.com.
Languages
The text in the dataset is in English
Dataset Structure
3 folders, with many text files in each. Each text file represent the body text of one article.
Source Data
URL data was scraped using https://github.com/mozilla/readability
Annotations
Articles were manually annotated by news editors… See the full description on the dataset page: https://huggingface.co/datasets/valurank/PoliticalBias_AllSides_Txt.
In April 2025, the news website with the most monthly visits in the United States was nytimes.com, with a total of ***** million monthly visits in that month. In second place was cnn.com with just over *** million visits, followed by foxnews.com with almost a ****** of a million. Online news consumption in the U.S. Americans get their news in a variety of ways, but social media is an increasingly popular option. A survey on social media news consumption revealed that ** percent of Twitter users regularly used the site for news, and Facebook and Reddit were also popular for news among their users. Interestingly though, social media is the least trusted news sources in the United States. News and trust Trust in news sources has become increasingly important to the American news consumer amidst the spread of fake news, and the public are more vocal about whether or not they have faith in a source to report news correctly. Ongoing discussions about the credibility, accuracy and bias of news networks, anchors, TV show hosts, and news media professionals mean that those looking to keep up to date tend to be more cautious than ever before. In general, news audiences are skeptical. In 2020, just **** percent of respondents to a survey investigating the perceived objectivity of the mass media reported having a great deal of trust in the media to report news fully, accurately, and fairly.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This is a copy of the Multi-News dataset, except the input source documents of the train, validation, and test splits have been replaced by a dense retriever. The retrieval pipeline used:
query: The summary field of each example corpus: The union of all documents in the train, validation and test splits retriever: facebook/contriever-msmarco via PyTerrier with default settings top-k strategy: "oracle", i.e. the number of documents retrieved, k, is set as the original number of input documents… See the full description on the dataset page: https://huggingface.co/datasets/allenai/multinews_dense_oracle.
The statistic shows the leading news websites in the Philippines as of May 2019, ranked by daily pageviews per visitor. Punch.dagupan.com ranked first with 3.7 daily pageviews per visitor, followed by Sunstar.com.ph with 2.87 daily pageviews per visitor.