100+ datasets found

i
Science and tech news dataset
ieee-dataport.org
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Thakur (2021). Science and tech news dataset [Dataset]. https://ieee-dataport.org/documents/science-and-tech-news-dataset
Explore at:
Dataset updated
Oct 27, 2021
Authors
Rajat Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains world news related to Science and technology and also with the news article's available metadata.
u
Spanish Fake News Dataset
produccioncientifica.ucm.es
zenodo.org
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro; Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro (2025). Spanish Fake News Dataset [Dataset]. https://produccioncientifica.ucm.es/documentos/685699246364e456d3a66786
Explore at:
Dataset updated
2025
Authors
Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro; Tretiakov, Arsenii; D'Antonio Maceiras, Sergio; Martín, Alejandro
Description
Spanish Fake News Dataset

This dataset contains a structured and annotated collection of false news items in Spanish (Castilian), gathered and processed for academic research on misinformation.

Dataset Scope

The dataset represents most of the recorded false news messages and their variations up to 01.02.2021.

Content Description

The dataset includes samples of false information in various formats:

News articles and headlines

Tweets and Facebook/Instagram/Telegram posts

YouTube video captions

WhatsApp text and voice message transcripts

Transcribed video/audio fragments with false claims

Fake government documents

Captions from photos and memes

Text extracted from images using OCR

Only Spanish (Castilian) texts were used, excluding regional variants (e.g., Catalan, Basque, Galician) for consistency.

Sources

The data was collected from the following verified fact-checking initiatives:

Maldito Bulo

Newtral

AFP Factual

Fact-checkers from these organizations provide detailed articles identifying and explaining falsehoods, often including:

General context of the event

Quotes or links to false claims

Analysis and explanation of why the claims are false

Verified information or corrections

Collection Method

The dataset was built using both manual extraction (e.g., identifying and quoting false statements) and automated parsing:

MyNews service: an archive of Spanish mass media

Custom scripts: for parsing and extracting structured data

OCR tools: for extracting text from images (e.g., memes and screenshots)

Fields Description

Column Name

Description

Topic

The thematic category of the news item (e.g., Politics, Health, COVID-19, Crime). Normalized and translated to English.

Link source

URL to the original news piece, fact-check report, or source of the claim. Invalid links were removed.

Media

The platform or outlet where the false claim appeared (e.g., Facebook, YouTube, WhatsApp). Normalized for consistent spelling and language.

Date

Publication or verification date of the news item, in YYYY-MM-DD format.

Author

(Optional) Author of the news or platform source, if available. May be empty.

Headlines

Title or summary of the news item or article containing the false information.

Fake statement

Quoted false claim or misinformation as cited in the verification article.

⚠️ Notes

The dataset was preprocessed to remove duplicates, invalid links, and non-textual clutter.

Field values were normalized to support multilingual and cross-platform analysis.

Only Castilian Spanish was retained for consistency and clarity.

📚 License & Use

This dataset is intended for non-commercial academic and research purposes. Please cite the original fact-checking organizations and this dataset if used in publications or analysis.
Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Wiegand (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
Gautam Kishore Shahi
Juliane Köhler
Thomas Mandl
Michael Wiegand
Melanie Siegel
Julia Maria Struß
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
h
cc_news
huggingface.co
Updated Jul 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir Blagojevic (2018). cc_news [Dataset]. https://huggingface.co/datasets/vblagoje/cc_news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2018
Authors
Vladimir Blagojevic
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for CC-News

Dataset Summary

CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.
h
all-the-news
huggingface.co
Updated Aug 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TabMaven (2019). all-the-news [Dataset]. https://huggingface.co/datasets/TabMaven/all-the-news
Explore at:
Dataset updated
Aug 17, 2019
Dataset authored and provided by
TabMaven
Description
TabMaven/all-the-news dataset hosted on Hugging Face and contributed by the HF Datasets community
h
All-Daily-News
huggingface.co
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Papers With Backtest (2024). All-Daily-News [Dataset]. https://huggingface.co/datasets/paperswithbacktest/All-Daily-News
Explore at:
Dataset updated
Sep 3, 2024
Dataset authored and provided by
Papers With Backtest
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Information

This dataset includes news data for various instruments.

Instruments Included

Stocks, ETFs, Forex, Cryptocurrencies, Commodities and more.

Dataset Columns

symbols: The symbols in the news, typically representing stock tickers or other financial instruments mentioned in the article. datetime: The date and time when the news article was published, formatted as a string. title: The title of the news article, providing a brief and descriptive… See the full description on the dataset page: https://huggingface.co/datasets/paperswithbacktest/All-Daily-News.
All The News text
kaggle.com
Updated Mar 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexey Voytsekhovskiy (2020). All The News text [Dataset]. https://www.kaggle.com/datasets/alexvoy/all-the-news-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alexey Voytsekhovskiy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Alexey Voytsekhovskiy

Released under CC0: Public Domain

Contents
i
Covid-19 and vaccine news dataset
ieee-dataport.org
Updated Oct 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Thakur (2021). Covid-19 and vaccine news dataset [Dataset]. https://ieee-dataport.org/documents/covid-19-and-vaccine-news-dataset
Explore at:
Dataset updated
Oct 27, 2021
Authors
Rajat Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains world news related to Covid-19 and vaccine and also with the news article's available metadata.
h
hausa_voa_topics
huggingface.co
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LSV @ Saarland University (2025). hausa_voa_topics [Dataset]. https://huggingface.co/datasets/UdS-LSV/hausa_voa_topics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2025
Dataset authored and provided by
LSV @ Saarland University
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for Hausa VOA News Topic Classification dataset (hausa_voa_topics)

Dataset Summary

A news headline topic classification dataset, similar to AG-news, for Hausa. The news headlines were collected from VOA Hausa.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Hausa (ISO 639-1: ha)

Dataset Structure Data Instances

An instance consists of a news title sentence and the corresponding topic label.… See the full description on the dataset page: https://huggingface.co/datasets/UdS-LSV/hausa_voa_topics.
w
Dataset - Male in the news
workwithdata.com
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset - Male in the news [Dataset]. https://www.workwithdata.com/news?pk=Male
Explore at:
Dataset updated
Jun 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset - Male in the news
Data from: News sentiment
kaggle.com
Updated Mar 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaushik Soni (2021). News sentiment [Dataset]. https://kaggle.com/kaushiksoni10/news-sentiment
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaushik Soni
Description
Content

This data contains news titles and headlines from different sources on different topics. The description of the columns is following;

| Column | DataType | Description | | --- | --- | | IDLink | str | Unique identifier of the row | | Title | str | Title of the news | | Headline | str | Headline of the news | | Source | str | Newspaper/news-source | | Topic | str | News-topic (values : obama, economy, microsoft, palestine) | | PublishDate | Timestamp | publish date & time | | Facebook | int | facebook rating | | GooglePlus | int | google plus rating | | LinkedIn | int | linkedin rating |

Inspiration

One of the main task that can be performed with this dataset is to perform Setiment Analysis. Find the Sentiment scores for each title and headline of the test data applying Regression Analysis.
Checking the news on weekdays in the U.S. 2018, by daypart
statista.com
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Checking the news on weekdays in the U.S. 2018, by daypart [Dataset]. https://www.statista.com/statistics/816469/check-news-typical-weekday-us-by-daypart/
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 17, 2018 - Jan 23, 2018
Area covered
United States
Description
This graph displays the time of day when consumers check the news on a typical weekday in the United States as of ************. During the survey, it was found that ** percent of consumers check the news in the early morning of a typical weekday.
Media Coding Dataset for News Content Analysis
zenodo.org
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stavros Doropoulos; Stavros Doropoulos; Elisavet Karapalidou; Polychronis Charitidis; Polychronis Charitidis; Sophia Karakeva; Sophia Karakeva; Stavros Vologiannidis; Stavros Vologiannidis; Elisavet Karapalidou (2025). Media Coding Dataset for News Content Analysis [Dataset]. http://doi.org/10.5281/zenodo.15767938
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15767938
Dataset updated
Jun 29, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stavros Doropoulos; Stavros Doropoulos; Elisavet Karapalidou; Polychronis Charitidis; Polychronis Charitidis; Sophia Karakeva; Sophia Karakeva; Stavros Vologiannidis; Stavros Vologiannidis; Elisavet Karapalidou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the study Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis.

It provides a reproducible benchmark for evaluating automated content analysis methods against human-annotated ground truth.

The dataset includes:

articles.csv
Contains the 200 news articles collected for this study, each with:

id: unique identifier

url: source URL of the original article

content: full text of the news article

codebook.json
A structured JSON file defining the 26-question analysis codebook used for annotation.
Each question entry specifies:

questionId: question ID (e.g., Q1)

prompt: annotation question text

questionAnswerType: type (SINGLE_CHOICE or MULTI_CHOICE)

eligibleQuestionAnswers: list of possible tags / codes

annotations.json
Contains the complete human annotation data.
For each article id, it provides the list of responses to all 26 codebook questions as determined by an expert annotator, establishing the ground truth labels.

Intended use

Designed for research popuses including natural language understanding, content classification, and LLM evaluation.

Please request access with your academic email.
w
Websites using News
webtechsurvey.com
csv
Updated Apr 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey (2024). Websites using News [Dataset]. https://webtechsurvey.com/technology/news
Explore at:
csvAvailable download formats
Dataset updated
Apr 22, 2024
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the News technology, compiled through global website indexing conducted by WebTechSurvey.
Leading news websites in the U.S. 2025, by monthly visits
tokrwards.com
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jul 2024
Area covered
United States
Description
In April 2025, the news website with the most monthly visits in the United States was nytimes.com, with a total of ***** million monthly visits in that month. In second place was cnn.com with just over *** million visits, followed by foxnews.com with almost a ****** of a million. Online news consumption in the U.S. Americans get their news in a variety of ways, but social media is an increasingly popular option. A survey on social media news consumption revealed that ** percent of Twitter users regularly used the site for news, and Facebook and Reddit were also popular for news among their users. Interestingly though, social media is the least trusted news sources in the United States. News and trust Trust in news sources has become increasingly important to the American news consumer amidst the spread of fake news, and the public are more vocal about whether or not they have faith in a source to report news correctly. Ongoing discussions about the credibility, accuracy and bias of news networks, anchors, TV show hosts, and news media professionals mean that those looking to keep up to date tend to be more cautious than ever before. In general, news audiences are skeptical. In 2020, just **** percent of respondents to a survey investigating the perceived objectivity of the mass media reported having a great deal of trust in the media to report news fully, accurately, and fairly.
t
Tweets – PAP News Dataset - Dataset - LDM
service.tib.eu
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Tweets – PAP News Dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/tweets---pap-news-dataset
Explore at:
Dataset updated
Dec 3, 2024
Description
New annotated datasets linking tweets and articles, including Tweets – PAP News Dataset, Tweets – BBC News Dataset, Cascades – PAP News Dataset, and Cascades – BBC News Dataset.
I
Global News Index and Extracted Features Repository (v.1.2.0)
databank.illinois.edu
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Global News Index and Extracted Features Repository (v.1.2.0) [Dataset]. http://doi.org/10.13012/B2IDB-5649852_V5
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5649852_V5
Dataset updated
Mar 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence. Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries. Additional Resources: - Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, please fill out the Archer Access Request Form so we can determine if you are eligible for access or not. - Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the Archer User Feedback Form. - The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this form to subscribe to it. Citation Guidelines: 1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation: Cline Center for Advanced Social Research. 2023. Global News Index and Extracted Features Repository [codebook], v1.2.0. Champaign, IL: University of Illinois. June. XX. doi:10.13012/B2IDB-5649852_V5 2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access): Cline Center for Advanced Social Research. 2023. Global News Index and Extracted Features Repository [database], v1.2.0. Champaign, IL: University of Illinois. Jun. XX. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V5 *NOTE: V4 is suppressed and V5 is replacing V4 with updated ‘Archer’ documents.
BBC Datasets
brightdata.com
.json, .csv, .xlsx
Updated Nov 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2017). BBC Datasets [Dataset]. https://brightdata.com/products/datasets/bbc
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Nov 8, 2017
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of BBC broadcast data with our comprehensive dataset featuring transcripts, program schedules, headlines, topics, and multimedia resources. This all-in-one dataset is designed to empower media analysts, researchers, journalists, and advocacy groups with actionable insights for media analysis, transparency studies, and editorial assessments.

Dataset Features

Transcripts: Access detailed broadcast transcripts, including headlines, content, author details, and publication dates. Perfect for analyzing media framing, topic frequency, and news narratives across various programs. Program Schedules: Explore program schedules with accurate timing, show names, and related metadata to track news coverage patterns and identify trends. Topics and Keywords: Analyze categorized topics and keywords to understand content diversity, editorial focus, and recurring themes in news broadcasts. Multimedia Content: Gain access to videos, images, and related articles linked to each broadcast for a holistic understanding of the news presentation. Metadata: Includes critical data points like publication dates, last updates, content URLs, and unique IDs for easier referencing and cross-analysis.

Customizable Subsets for Specific Needs Our CNN dataset is fully customizable to match your research or analytical goals. Focus on transcripts for in-depth media framing analysis, extract multimedia for content visualization studies, or dive into program schedules for broadcast trend analysis. Tailor the dataset to ensure it aligns with your objectives for maximum efficiency and relevance.

Popular Use Cases

Media Analysis: Evaluate news framing, content diversity, and topic coverage to assess editorial direction and media focus. Transparency Studies: Analyze journalistic standards, corrections, and retractions to assess media integrity and accountability. Audience Engagement: Identify recurring topics and trends in news content to understand audience preferences and behavior. Market Analysis: Track media coverage of key industries, companies, and topics to analyze public sentiment and industry relevance. Journalistic Integrity: Use transcripts and metadata to evaluate adherence to reporting practices, fairness, and transparency in news coverage. Research and Scholarly Studies: Leverage transcripts and multimedia to support academic studies in journalism, media criticism, and political discourse analysis.

Whether you are evaluating transparency, conducting media criticism, or tracking broadcast trends, our BBC dataset provides you with the tools and insights needed for in-depth research and strategic analysis. Customize your access to focus on the most relevant data points for your unique needs.
h
NEWS-COPY-eval
huggingface.co
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chenghao Mou (2024). NEWS-COPY-eval [Dataset]. https://huggingface.co/datasets/chenghao/NEWS-COPY-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2024
Authors
Chenghao Mou
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
NEWS COPY

This dataset contains the evaluation and test sets for the NEWS COPY dataset. Original source can be found at Github. The license is unclear. It contains the following data:

Historical Newspapers

Training datasets can be found at chenghao/NEWS-COPY-train.

Citation

@inproceedings{silcock-etal-2020-noise, title = "Noise-Robust De-Duplication at Scale", author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa", booktitle =… See the full description on the dataset page: https://huggingface.co/datasets/chenghao/NEWS-COPY-eval.
H
Replication Data for: How the News Media Activates Public Expression and...
dataverse.harvard.edu
search.dataone.org
Updated Nov 13, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gary King; Benjamin Schneer (2017). Replication Data for: How the News Media Activates Public Expression and Influences National Agendas [Dataset]. http://doi.org/10.7910/DVN/1EMHTK
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/1EMHTK
Dataset updated
Nov 13, 2017
Dataset provided by
Harvard Dataverse
Authors
Gary King; Benjamin Schneer
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1EMHTKhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1EMHTK
Description
We demonstrate that the news media causes Americans to take public stands on issues, join national policy conversations, and express themselves publicly more often than they would otherwise --- all key components of democratic politics. We recruited 48 mostly small media outlets that allowed us to choose groups of outlets to write and publish articles, on subjects we approved, and dates we randomly assigned. We estimate the causal effect on proximal measures, such as website pageviews and Twitter discussion of the articles' specific subjects, and distal ones, such as national Twitter conversation in broad policy areas. Our intervention increased discussion in each broad policy area by $\approx$ 62.7% (relative to a day's volume), accounting for 13,166 additional posts, with similar effects across population subgroups.