Facebook
TwitterThis dataset was created by Qerav23
Facebook
Twitter2.7 million news articles and essays
Dataset Description
2.7 million news articles and essays from 27 American publications. Includes date, title, publication, article text, publication name, year, month, and URL (for some). Articles mostly span from 2016 to early 2020.
Type: CSV Size: 3.4 GB compressed, 8.8 GB uncompressed Created by: Andrew Thompson Date added: 4/3/2020 Date modified: 4/3/2020 source: Component one Datasets 2.7 Millions Date of Download and processed:… See the full description on the dataset page: https://huggingface.co/datasets/rjac/all-the-news-2-1-Component-one.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.
The BCP-47 code for English as generally spoken in the United States is en-US and the BCP-47 code for English as generally spoken in the United Kingdom is en-GB. It is unknown if other varieties of English are represented in the data.
For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the CNN / Daily Mail dataset viewer to explore more examples.
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .
Previously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
The average token count for the articles and the highlights are provided below:
| Feature | Mean Token Count |
|---|---|
| Article | 781 |
| Highlights | 56 |
id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved fromarticle: a string containing the body of the news article highlights: a string containing the highlight of the article as written by the article authorThe CNN/DailyMail dataset has 3 splits: train, validation, and test. Below are the statistics for Version 3.0.0 of the dataset.
| Dataset Split | Number of Instances in Split |
|---|---|
| Train | 287,113 |
| Validation | 13,368 |
| Test | 11,490 |
Version 1.0.0 aimed to support supervised neural methodologies for machine reading and question answering with a large amount of real natural language training data and released about 313k unique articles and nearly 1M Cloze style questions to go with the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization rather than question answering. Version 3.0.0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels.
The data consists of news articles and...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Newport News by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Newport News. The dataset can be utilized to understand the population distribution of Newport News by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Newport News. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Newport News.
Key observations
Largest age group (population): Male # 20-24 years (8,018) | Female # 30-34 years (7,684). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Newport News Population by Gender. You can refer the same here
Facebook
TwitterMany Americans consume aligned partisan media, which scholars worry contributes to polarization. Many propose encouraging these Americans to consume cross-cutting media to moderate their attitudes. However, motivated reasoning theory posits that exposure to cross-cutting media could backfire, exacerbating polarization. Building on theories that sustained exposure to novel information can overcome motivated reasoning and that partisan sources on opposite sides cover distinct information, we argue that sustained consumption of cross-cutting media leads voters to learn uncongenial information and moderate their attitudes in covered domains. To test this argument, we used data on actual TV viewership to recruit a sample of regular Fox News viewers and incentivized a randomized treatment group to watch CNN instead for a month. Contrary to predictions from motivated reasoning, watching CNN caused substantial learning and moderated participants' attitudes in covered domains. We close by discussing challenges partisan media may pose for democracy.
Facebook
TwitterDuring a 2024 survey, 77 percent of respondents from Nigeria stated that they used social media as a source of news. In comparison, just 23 percent of Japanese respondents said the same. Large portions of social media users around the world admit that they do not trust social platforms either as media sources or as a way to get news, and yet they continue to access such networks on a daily basis.
Social media: trust and consumption
Despite the majority of adults surveyed in each country reporting that they used social networks to keep up to date with news and current affairs, a 2018 study showed that social media is the least trusted news source in the world. Less than 35 percent of adults in Europe considered social networks to be trustworthy in this respect, yet more than 50 percent of adults in Portugal, Poland, Romania, Hungary, Bulgaria, Slovakia and Croatia said that they got their news on social media.
What is clear is that we live in an era where social media is such an enormous part of daily life that consumers will still use it in spite of their doubts or reservations. Concerns about fake news and propaganda on social media have not stopped billions of users accessing their favorite networks on a daily basis.
Most Millennials in the United States use social media for news every day, and younger consumers in European countries are much more likely to use social networks for national political news than their older peers.
Like it or not, reading news on social is fast becoming the norm for younger generations, and this form of news consumption will likely increase further regardless of whether consumers fully trust their chosen network or not.
Facebook
TwitterThe Chronicling America projects has several APIs that make pulling down data easy for digitized offerings, but this list in particular you still have to crawl in order to get the full list.
Various metadata on a large list of newspapers published in the United States of America from 1690 through today.
Library of Congress' Chronicling America project.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom: The Guardian, The Times, The Independent, The Daily Mirror, BBC, Financial Times, Metro, Telegraph, The and The Daily Mail plus a few additional American-based outlets used for comparison reference. The target words are listed in the associated manuscript and are mostly words that denote some type of prejudice, social justice related terms or counterreaction to it. A few additional words are also available since they are used in the manuscript for illustration purposes.
The textual content of news and opinion articles from the outlets listed in Figure 3 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We derived relative frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript
-targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles
-targetWordsInArticlesCountsGuardianExampleWords contains counts of target words in outlets articles as well as total counts of words in articles for illustrative Figure 1 in main manuscript
Usage Notes
In a small percentage of articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.
Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. To conclude, in a data analysis of 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consumer Spending in the United States increased to 16445.70 USD Billion in the second quarter of 2025 from 16345.80 USD Billion in the first quarter of 2025. This dataset provides the latest reported value for - United States Consumer Spending - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Facebook
Twitterhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.5683/SP3/BPJP9Uhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.5683/SP3/BPJP9U
Introduction Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT3 corpus was created to support three TDT3 tasks: to find topically homogeneous sections (segmentation), to detect the occurrence of new events (detection) and to track the reoccurrence of old or new events (tracking). Data TDT3 Multilanguage Text Corpus Version 2.0 is the first general release of this collection (Version 1.0 was made available only to participants in the TDT 1999 and 2000 evaluation tests). It contains data from the same nine sources found in TDT2, plus two additional English television sources. Like TDT2, it provides both manually-created and automatically-generated text for most sources. For TDT3, the daily collection took place over a period of three months (October - December 1998). The sources and approximate number of stories per source are as follows: English sources Thousands of stories New York Times Newswire Service 6.9 Associated Press Worldstream Service 7.3 Cable News Network, "Headline News" 9.0 American Broadcasting Co., "World News Tonight" 1.0 Public Radio International, "The World" 1.6 Voice of America, English news programs 3.9 MS-NBC, "News with Brian Williams" 0.7 National Broadcasting Co., "NBC Nightly News" 0.8 Total English stories: 31.2 thousand Mandarin sources Thousands of stories Xinhua News Agency 5.2 Zaobao News Agency 3.8 Voice of America, Mandarin Chinese news programs 3.8 Total Mandarin stories: 12.8 thousand The goal of Topic Detection and Tracking - Phase 3 (TDT3) is to create core technology to monitor multiple streams of news in multiple languages and media (newswire, radio, television, web sites or some future combination or innovation), segmenting the streams into individual stories, detecting new topics and tracking all stories discussing them. In additional to the TDT2 tasks of segmentation, detection and tracking, TDT3 adds the tasks of first story detection and story-link detection. The goal of the latter is to detect links between stories that discuss the same topic even though the topic has not been defined in advance. There are two types of files in this publication: asr_sgm -- text data output from automatic speech recognition (ASR) systems in English and Mandarin, formatted in "TIPSTER- style" SGML, derived from the audio recordings of radio and TV broadcasts. tkn_sgm -- reference text data (newswire, closed captions and manual transcripts), formatted in "TIPSTER-style" SGML Samples Please view this asr_sgm sample and tkn_sgm sample. Updates 07/21/16 - Topic tables added. 07/01/16 - Topic tables updated to v4.0. Copyright Portions © 1998 American Broadcasting Company, The Associated Press, Cable News Network, LP, LLLP, National Broadcasting Company, Inc., New York Times, Public Radio International, SPH AsiaOne Ltd, Xinhua News Agency, © 1998-2001 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Facebook
TwitterThe New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unemployment Rate in the United States increased to 4.40 percent in September from 4.30 percent in August of 2025. This dataset provides the latest reported value for - United States Unemployment Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consumer Confidence in the United States decreased to 51 points in November from 53.60 points in October of 2025. This dataset provides the latest reported value for - United States Consumer Sentiment - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Facebook
TwitterA global survey conducted in the third quarter of 2024 found that the main reason for using social media was to keep in touch with friends and family, with over 50.8 percent of social media users saying this was their main reason for using online networks. Overall, 39 percent of social media users said that filling spare time was their main reason for using social media platforms, whilst 34.5 percent of respondents said they used it to read news stories. Less than one in five users were on social platforms for the reason of following celebrities and influencers.
The most popular social network
Facebook dominates the social media landscape. The world's most popular social media platform turned 20 in February 2024, and it continues to lead the way in terms of user numbers. As of February 2025, the social network had over three billion global users. YouTube, Instagram, and WhatsApp follow, but none of these well-known brands can surpass Facebook’s audience size.
Moreover, as of the final quarter of 2023, there were almost four billion Meta product users.
Ever-evolving social media usage
The utilization of social media remains largely gratuitous; however, companies have been encouraging users to become paid subscribers to reduce dependence on advertising profits. Meta Verified entices users by offering a blue verification badge and proactive account protection, among other things. X (formerly Twitter), Snapchat, and Reddit also offer users the chance to upgrade their social media accounts for a monthly free.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis dataset was created by Qerav23