Facebook
TwitterSocial media was by far the most popular news platform among 18 to 34-year-olds in the United States, with 47 percent of respondents to a survey held in August 2022 saying that they used social networks for news on a daily basis. By comparison, adults over 65 years old mostly used network news to keep up to date.
The decline of newspapers In the past, the reasons to regularly go out and purchase a print newspaper were many. Used not only for news but also apartment hunting, entertainment, and job searches (among other things), newspapers once served multiple purposes. This is no longer the case, with first television and then the internet taking care of consumer needs once covered by printed papers. Indeed, the paid circulation of daily weekday newspapers in the United States has fallen dramatically since the 1980s with no sign of future improvement.
News consumption habits
A survey on news consumption by gender found that 50 percent of women use either online-only news sites or social media for news each day, and 51 percent of male respondents said the same. Social media was by far the most used daily news platform among U.S. Millennials, and the same was true of Gen Z. One appeal of online news is that it often comes at no cost to the consumer. Paying for news found via digital outlets is not yet commonplace in the United States, with only 21 percent of U.S. consumers responding to a study held in early 2021 reporting having paid for online news content in the last year.
Facebook
TwitterIn April 2025, the news website with the most monthly visits in the United States was nytimes.com, with a total of ***** million monthly visits in that month. In second place was cnn.com with just over *** million visits, followed by foxnews.com with almost a ****** of a million. Online news consumption in the U.S. Americans get their news in a variety of ways, but social media is an increasingly popular option. A survey on social media news consumption revealed that ** percent of Twitter users regularly used the site for news, and Facebook and Reddit were also popular for news among their users. Interestingly though, social media is the least trusted news sources in the United States. News and trust Trust in news sources has become increasingly important to the American news consumer amidst the spread of fake news, and the public are more vocal about whether or not they have faith in a source to report news correctly. Ongoing discussions about the credibility, accuracy and bias of news networks, anchors, TV show hosts, and news media professionals mean that those looking to keep up to date tend to be more cautious than ever before. In general, news audiences are skeptical. In 2020, just **** percent of respondents to a survey investigating the perceived objectivity of the mass media reported having a great deal of trust in the media to report news fully, accurately, and fairly.
Facebook
TwitterDuring a 2025 survey, ** percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just ** percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis. Social media: trust and consumption Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than ** percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than ** percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media. What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis. Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers. Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.
Facebook
TwitterA study held in 2025 revealed that ** percent of X (formerly known as Twitter) users regularly used X for news. By contrast, users of major platforms Instagram, TikTok, and YouTube were less inclined to get their news from those sites, though usage of TikTok for news increased to ** percent in 2025 compared to 2020, with the platform especially popular among younger audiences.
Facebook
TwitterAs part of a capstone project, we wanted to compare what social media users are talking about to what's going on in the world to see if and how social media users care about news events. We scraped data from Twitter, Reddit, reliable news sources, and Google Trending Topics.
This data set includes nine tables: Twitter, news, Google Trending Topics, and six popular subreddits (news, worldnews, upliftingnews, sports, politics, television).
Twitter: trending topic, date trending, sentiment analysis scores, most common word associated with the trend, most common pairs of words associated with the trend.
News: headlines (collected from BBC News, USA Today, and the Washington Post), date the article was posted.
Google Trending Topics: trending topic, date trending.
Subreddits: post title, time, date, score (upvotes - downvotes), number of comments.
This data was collected as part of a semester project in the Capstone in Social Network Analytics at Virginia Tech, Spring 2017, taught by Siddharth Krishnan. The data was collected over a period of eight days in April 2017.
What do social media users care about, and in what ways do they care? What may they not know about? What types of trends appear most on each social media platform? Are people who get the majority of their news from social media able to get an accurate and comprehensive idea of what is going on? How can algorithms such as Twitter’s trending topics algorithm influence and shape what users talk about, read, and react to?
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.
This large dataset is ideal for:
Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.
The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset comprises news articles collected over the past few months using the NewsAPI. The primary motivation behind curating this dataset was to develop and experiment with various natural language processing (NLP) models. The dataset aims to support the creation of text summarization models, sentiment analysis models, and other NLP applications.
The data is sourced from the NewsAPI, a comprehensive and up-to-date news aggregation service. The API provides access to a wide range of news articles from various reputable sources, making it a valuable resource for constructing a diverse and informative dataset.
The data for this dataset was collected using a custom Python script. You can find the script used for data retrieval dailyWorker.py. This script leverages the NewsAPI to gather information on news articles over a specified period.
Feel free to explore and modify the script to suit your data collection needs. If you have any questions or suggestions for improvement, please don't hesitate to reach out.
The file ratings.csv in this dataset has been labeled using the NLP model cardiffnlp/twitter-roberta-base-sentiment for sentiment classification.
This labeling was applied to facilitate sentiment-based research and analysis tasks.
The inspiration behind collecting this dataset stems from the growing interest in NLP applications and the need for high-quality, real-world data to train and evaluate these models effectively. By leveraging the NewsAPI, we aim to contribute to the development of robust text summarization and sentiment analysis models that can better understand and process news content.
ratings.csv) Note:
Please refer to the NewsAPI documentation for terms of use and ensure compliance with their policies when using this dataset.
Facebook
TwitterSocial media was the most popular news platform amongst Americans as of February 2022 and was used most regularly by women, with 39 percent of female respondents to a survey saying that they used social networks for news on a daily basis. Meanwhile, twice the share of men than women reported reading newspapers each day.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock real-time insights from Time Magazine's Latest News Dataset through our platform in just a few simple steps. Whether you're a researcher, marketer, or business analyst, this dataset offers comprehensive coverage of global news from one of the world’s most trusted sources. Here’s how you can get started:
Begin by signing up for an account on our platform. This gives you access to all of our data services, including the Time Magazine Latest News Dataset.
Browse our offerings and select the Time Magazine Latest News Dataset plan that fits your needs. Once you’ve made your choice, add it to your cart and proceed to the checkout page.
Complete your purchase by paying through our secure payment options. We accept multiple payment methods to ensure a smooth and easy transaction process.
Once your payment is processed, you will receive an invoice for your purchase. Our team will then provide you with immediate access to the dataset, along with the relevant download instructions and login details.
After gaining access, you’ll be able to download the Time Magazine Latest News Dataset, which includes news articles extracted as of March 2021. While this dataset is not a live feed, it offers historical articles and insights that can be used for trend analysis, research, and content aggregation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Data is collected from various media houses home page to see which News media shares/writes articles with less gory words.
Datasource is obtained from these websites which are downloaded from a time period of Oct 2017 to Nov 2017:
1. "http://www.nytimes.com/"
2. "http://www.foxnews.com/"
3. "http://www.reuters.com/"
4. "http://www.cnn.com/"
5. "http://www.huffingtonpost.com/"
Each folder is named in the mmddyyyy convention. And Each CSV file has the media house name as the file name(eg: reuters.csv). The CSV has the following columns:
TITLE: the Title of the article.SUMMARY: first few lines of the article's text.TEXT: Full text inside the articleURL: web link to the article.KEYWORDS: important words in the article.This dataset is under CC0: public domain license.
All around the world both good and bad happens, and we get to know only those that are exposed to us. And, that’s the primary responsibility of the media. But the bigger responsibility of these media houses is the way in which they express the content to the people.
A responsible media house’s content should be original, unbiased, free of exaggeration and should be very sensitive in handling the emotions of it’s readers and viewers. A same story could be told in different ways and these different ways could definitely trigger different emotions among it’s readers.
It is known that we become who we are by what we say and what we read. Reading a story that’s filled with positive words would make us feel more positive and vice versa. So the wordings of a content definitely plays an equal role as that of the content itself.
This dataset stands as sample to find out which media house conveys the NEWS in more optimistic way!!!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
WWFND: World Wide Fake News Dataset 1. Introduction The World Wide Fake News Dataset (WWFND) has been developed with the objective of facilitating research in the domain of fake news detection. This dataset has been created using Python’s web scraping library – BeautifulSoup, and comprises news articles collected from multiple globally recognised fact-checking and media organisations. The data has been carefully compiled from reputable news and fact-verification platforms identified by the Pew Research Center, including but not limited to:
BBC News
CNN
Al Jazeera
Times of India
The Hindu
PolitiFact
NBC News
CBS News
ABC News
NDTV
The Wire
These sources have been selected for their credibility and global or national reach. News articles were collected only after ensuring that they had been clearly classified as either true or fake by these organisations.
2. Dataset Summary The dataset comprises a total of 30,616 records, which include:
15,027 records identified as true news articles
15,589 records identified as fake news articles
To further enhance the robustness and applicability of the dataset, it has been combined with another dataset titled COVID19_FNIR, available through the IEEE Dataport at the following link: https://ieee-dataport.org/open-access/covid-19-fake-news-infodemic-research-dataset-covid19-fnir-dataset
This integration was undertaken to provide a more comprehensive dataset, especially for training machine learning models in detecting misinformation during global crises such as the COVID-19 pandemic.
3. Contents of the Dataset The WWFND dataset includes the following files:
This file contains the cleaned and preprocessed version of the dataset, combining both fake and true news articles.
This file contains the raw, unprocessed fake news articles collected from the sources mentioned above.
This file contains the raw, unprocessed true news articles obtained from the verified sources.
4. Applications This dataset is suitable for various applications, including:
Training and testing models for fake news detection
Text classification and content analysis using Natural Language Processing (NLP) techniques
Research in media literacy, misinformation tracking, and credibility assessment
Academic projects and data science competitions focused on information verification
5. Acknowledgements The dataset creators acknowledge the use of publicly available content solely for academic and research purposes. The COVID19_FNIR dataset has been used with reference to its source on IEEE Dataport.
6. Licensing and Usage This dataset is intended for educational and research use only. Users are advised to cite the original sources and the IEEE dataset if the WWFND dataset is used in any publication or project.
Facebook
TwitterContext
Social media is a vast pool of content, and among all the content available for users to access, news is an element that is accessed most frequently. These news can be posted by politicians, news channels, newspaper websites, or even common civilians. These posts have to be checked for their authenticity, since spreading misinformation has been a real concern in today’s times, and many firms are taking steps to make the common people aware of the consequences of spread misinformation. The measure of authenticity of the news posted online cannot be definitively measured, since the manual classification of news is tedious and time-consuming, and is also subject to bias.
Content
Data preprocessing has been done on the dataset Getting Real about Fake News and skew has been eliminated.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Business Context
The advent of e-news, or electronic news, portals has offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.
E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.
[Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]
Objective
The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:
Do the users spend more time on the new landing page than on the existing landing page? Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page? Does the converted status depend on the preferred language? Is the time spent on the new page the same for the different language users?
Data Dictionary
The data contains information regarding the interaction of users in both groups with the two versions of the landing page.
user_id - Unique user ID of the person visiting the website group - Whether the user belongs to the first group (control) or the second group (treatment) landing_page - Whether the landing page is new or old time_spent_on_the_page - Time (in minutes) spent by the user on the landing page converted - Whether the user gets converted to a subscriber of the news portal or not language_preferred - Language chosen by the user to view the landing page
Facebook
TwitterThe dataset contains detailed information on some of the most popular English media channels on Youtube. From channel overview to statistics of the top 50 videos of each channel, here is a description of all the columns of the two datasets.
Mainstream Media Statistics
Top50_viewed_video_from_each_channels
Inspirations
Data is scraped using Youtube API, feel free to use the data as long as it copes with the term of uses of Youtube. Something you can do with the dataset may be to analysis what news are of people's interest or to watch some of the most viewed news in the world to stay close with the society.
Facebook
TwitterDespite eroding consensus about credible political news sources, much of the public still trusts local media. We assess the emergence, sources and implications of the trust advantage local news holds over national media. We argue the public now uses a news outlet's local orientation as a shortcut to assess its credibility. In survey experiments, we find unfamiliar news outlets are trusted more when they have a local cue in their name. In surveys where people evaluate digital sources covering their community, this heuristic leads the public to trust unreliable information providers that signal a local focus more than high-quality sources that do not. Our findings position local media as unique, broadly trusted communicators while also illustrating a logic behind recent efforts to disseminate biased political information by packaging it as local news. More broadly, we show the challenges that arise when the public applies once-reliable heuristics in changing political circumstances.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
As we all know, Fake-News has become the centre of attraction worldwide because of its hazardous impact on our society. One of the recent example is spread of Fake-news related to Covid-19 cure, precautions, and symptoms and you must be understood by now, how dangerous this bogus information could be. Distorted piece of information propagated at the times of election for achieving political agenda is not hidden from anyone.
Fake news is quickly becoming an epidemic, and it alarms and angers me how often and how rapidly totally fabricated stories circulate. Why? In the first place, the deceptive effect: the fact that if a lie is repeated enough times, you’ll begin to believe it’s true.
You understand by now that fake news and other types of false information can take on various appearances. They can likewise have significant effects, because information shapes our world view: we make important decisions based on information. We form an idea about people or a situation by obtaining information. So if the information we saw on the Web is invented, false, exaggerated or distorted, we won’t make good decisions.
Hence, Its in dire need to do something about it and It's a Big Data problem, where data scientist can contribute from their end to fight against Fake-News.
Although, fighting against fake-News is a big data problem but I have created this small dataset having approx. 10,000 piece of news article and meta-data scraped through approx. 600 web-pages of Politifact website to analyse it using data science skills and get some insights of how can we stop spread of misinformation at broader aspect and what approach will give us better accuracy to achieve the same.
This dataset is having 6 attributes among which News_Headline is the most important to us in order to classify news as FALSE or TRUE. As you notice the Label attribute clearly, there are 6 classes specified in it. So, it's totally up-to you whether you want to use my dataset for multi-class classification or convert these class labels into FALSE or TRUE and then, perform binary classification. Although, for your convenience, I will write a notebook on how to convert this dataset from multi-class to binary-class. To deal with the text data, you need to have good hands on practice on NLP & Data-Mining concepts.
News_Headline - contains piece of information that has to be analysed. Link_Of_News - contains url of News Headlines specified in very first column.Source - this column contains author names who has posted the information on facebook, instagram, twitter or any other social-media platform.Stated_On - This column contains date when the information is posted by the authors on different social-media platforms.Date - This column contains date when this piece of information is analysed by politifact team of fact-checkers in order to labelize as FAKE or REAL.Label - This column contains 5 class labels : True, Mostly-True, Half-True, Barely-True, False, Pants on Fire.So, you can either perform multi-class classification on it or convert Mostly-True, Half-True, Barely-True as True and drop Pants on Fire and perform Binary-class classification.
A very Big thanks to fact-checking team of Politifact.com website as they provide with correct labels by working hard manually. So that we data science people can take advantage to train our models on such labels and make better models. These are some research papers that will help you to get start with the project and clear your fundamentals.
"https://journals.sagepub.com/doi/full/10.1177/2053951719843310">Big Data and quality data for fake news and misinformation detection by Fatemeh Torabi Asr, Maite Taboada
"https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/pra2.2015.145052010082">Automatic deception detection: Methods for finding fake news by Nadia K. Conroy Victoria L. Rubin Yimin Chen
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Facebook
TwitterTechsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.
Key Features of the Dataset: Comprehensive Coverage:
The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:
News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:
The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:
Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:
Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:
The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:
Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
Facebook
TwitterTechsalerator’s News Event Data in Asia offers a detailed and expansive dataset designed to provide businesses, analysts, journalists, and researchers with comprehensive insights into significant news events across the Asian continent. This dataset captures and categorizes major events reported from a diverse range of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable perspectives on regional developments, economic shifts, political changes, and cultural occurrences.
Key Features of the Dataset: Extensive Coverage:
The dataset aggregates news events from a wide range of sources such as company press releases, industry-specific news outlets, blogs, PR sites, and traditional media. This broad coverage ensures a diverse array of information from multiple reporting channels. Categorization of Events:
News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly find and analyze information relevant to their interests or sectors. Real-Time Updates:
The dataset is updated regularly to include the most current events, ensuring users have access to the latest news and can stay informed about recent developments as they happen. Geographic Segmentation:
Events are tagged with their respective countries and regions within Asia. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:
Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps users understand the context and significance of each event. Historical Data:
The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into the evolution of news events. Advanced Search and Filter Options:
Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Asian Countries and Territories Covered: Central Asia: Kazakhstan Kyrgyzstan Tajikistan Turkmenistan Uzbekistan East Asia: China Hong Kong (Special Administrative Region of China) Japan Mongolia North Korea South Korea Taiwan South Asia: Afghanistan Bangladesh Bhutan India Maldives Nepal Pakistan Sri Lanka Southeast Asia: Brunei Cambodia East Timor (Timor-Leste) Indonesia Laos Malaysia Myanmar (Burma) Philippines Singapore Thailand Vietnam Western Asia (Middle East): Armenia Azerbaijan Bahrain Cyprus Georgia Iraq Israel Jordan Kuwait Lebanon Oman Palestine Qatar Saudi Arabia Syria Turkey (partly in Europe, but often included in Asia contextually) United Arab Emirates Yemen Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and identify emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Asia, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Asian news and events. Techsalerator’s News Event Data in Asia is a crucial resource for accessing and analyzing significant news events across the continent. By offering detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
Facebook
TwitterSocial media was by far the most popular news platform among 18 to 34-year-olds in the United States, with 47 percent of respondents to a survey held in August 2022 saying that they used social networks for news on a daily basis. By comparison, adults over 65 years old mostly used network news to keep up to date.
The decline of newspapers In the past, the reasons to regularly go out and purchase a print newspaper were many. Used not only for news but also apartment hunting, entertainment, and job searches (among other things), newspapers once served multiple purposes. This is no longer the case, with first television and then the internet taking care of consumer needs once covered by printed papers. Indeed, the paid circulation of daily weekday newspapers in the United States has fallen dramatically since the 1980s with no sign of future improvement.
News consumption habits
A survey on news consumption by gender found that 50 percent of women use either online-only news sites or social media for news each day, and 51 percent of male respondents said the same. Social media was by far the most used daily news platform among U.S. Millennials, and the same was true of Gen Z. One appeal of online news is that it often comes at no cost to the consumer. Paying for news found via digital outlets is not yet commonplace in the United States, with only 21 percent of U.S. consumers responding to a study held in early 2021 reporting having paid for online news content in the last year.