https://brightdata.com/licensehttps://brightdata.com/license
Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.
Dataset Features
News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.
Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.
Popular Use Cases
Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.
Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
This is the Newspaper collection of The National Library of The Netherlands (KB). "The KB promotes the visibility, usability and longevity of the Dutch Library Collection, defined as the collective holdings of all publicly funded libraries in the Netherlands" (KB mission statement). The following figures give answers to common questions about the composition of this collection: What part of the collection is included in the Media Suite? The Media Suite gives access to the KB's newspaper "basic collection". "The basic collection contains approximately 11 million newspaper pages from the Netherlands, the Dutch East Indies, the Antilles, America and Surinam from 1618 to 1995. This is about 15% of all newspapers that have ever been published in the Netherlands" (KB "wat zit er in Delpher?"). What years does the archive cover? The KB newspaper basic collection includes newspapers from 1618 to 1995. The Media Suite harvested all the items available and integrated them into the Media Suite in May 2018. Figure 1: Number of newspaper articles in the collection over time How and how often is the data updated in the Media Suite? The collection's metadata and their OCR enrichments are made available to the CLARIAH Media Suite by the KB via their harvesting end-point (OAI-PMH). The latest update to the Media Suite's data from EYE Jean Desmet film collection has been done in May, 2018. What kind of media is included? The collection includes newspaper content of different types: articles, advertisements, illustrations with captions, and obituaries). Figure 2: Types of content in the KB newspaper basic collection What portion of the collection is digital? A big part of the KB newspaper basic collection is digital. The KB is progressively digitizing more newspapers (KB "wat zit er in Delpher?"). Via the Media Suite, users can access the digitized newspapers in the KB Delpher search engine. Does the collection include enrichments? This collection has undergone object character recognition (OCR) processes. The OCR output is available via the Media Suite for searching purposes only. To read the OCR, users are redirected to the KB Delpher search engine. Figure 3: Proportion of OCR-ed content in the KB newspaper basic collection Where to find more information? KB newspaper collection site (in English) (in Dutch) KB Delpher (newspapers) search engine KB information about "what is available via Delpher?"
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Largest Bengali Newspaper Dataset for news type classification.
Abstract:
Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, Inter-national, Entertainment, Economy, Education, Politics, and Science & Technology) providing five attributes (News Article, Category, Headline, Publication Date, and newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.
cite:
@misc{ahmad2022potrika,
title={Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes},
author={Istiak Ahmad and Fahad AlQurashi and Rashid Mehmood},
year={2022},
eprint={2210.09389},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Dataset Source - Here
The data-file include running text from a representative sample of two German newspapers, Die Welt and Süddeutsche Zeitung, during the period 1 November 1966 to 30 October 1967.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the English Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the English language.
Dataset Contain & Diversity:Containing a total of 5000 images, this English OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible English text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native English Speaking people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English text recognition models.
Update & Custom Collection:We're committed to expanding this dataset by continuously adding more images with the assistance of our native English language crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the English language. Your journey to enhanced language understanding and processing starts here.
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
newspapers with articles published after 1954
Current version: v1.7
Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the 'copyright cliff of death').
The data was extracted from the Trove API using this notebook from the Trove newspapers section of the GLAM Workbench.
The data is available as a CSV file entitled newspapers_post_54.csv and contains the following fields:
title – the full title of the newspaper state – the state in which the newspaper was published id – Trove's unique identifier for this newspaper startDate – the earliest date of articles from this newspaper available in Trove endDate – the latest date of articles from this newspaper available in Trove issn – ISSN number_of_articles – the number of articles from this newspaper published after 1954 available in Trove troveUrl – link to more information about this newspaper
This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.
URLs for the 2,353,652 news articles covered in the study, as collected from the main feeds of the online news outlets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Online newspapers are becoming increasingly popular and may pose a threat to the traditional print newspaper market. To investigate this market, this study aims to assess how motivational variables (i.e., human values and social axioms) and affective and rational judgments comparatively influence the use of print and online newspapers. Therefore, we have applied the Consumer Cultural Influence Model (CCIM) to this subject. Our research investigates print and online newspaper usage in two different ways. One is exploratory designed to identify newspaper attributes through 11 interviews, and the other uses an online survey (N=498) to evaluate the relationships between the model’s constructs. The analyses conducted using structural equation modeling, demonstrate that the usage for each type of newspaper is different. Print newspapers involve affective judgment and the establishing of an emotional attachment to the product for those who have a preference for print newspapers. In terms of online newspapers, the relationship is rational for those who prefer online newspapers. The originality of this research has to do with its examination of the perspective of the newspaper consumer, and its identification of opposing idiosyncrasies associated with these differing preferences. It also applies comparative models to this market, dealing not only with newspaper attributes, but also subjective aspects linked to newspaper consumption.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset produced during the project Viral Culture in Early Nineteenth-Century Europe.
The project traced text reuse by analysing large OCR'd newspaper collections using a BLAST based algorithm. This algorithm produces text clusters.
This dataset contains two produced cluster datasets based on two different data collections.
For the first dataset, the Austrian ANNO newspaper collection, this dataset contains metadata describing the used newspapers.
For the second dataset, German-language newspapers in the Europeana collection, this dataset contains project produced metadata describing the newspapers used by the project, as well as the OCR's content for these newspaper issues. The OCR is produced with Tesseract OCR from digital page images downloaded from the Europeana services.
41-Year Textual Digital Archive of nytimes.com, which consists of all available articles (approximately 4,000,000) published by The New York Times, including but not limited to news, lifestyle, opinion and The New York Times Magazine, and excludes reader comments, paid obituaries and the kids section. Article data is available from 1980-2021.
The New York Times TDM Archive was originally received as NIFT-encoded XML objects. See 'Bulk Data Access' (below) for more information.
The xml was transformed into tabular data (for inclusion in Redivis) using xmltotabular with the accompanying configuration file.
Data access is required to view this section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NOTE: This is a badly rendered version of the README within the archive.
A data-driven approach to studying changing vocabularies in historical newspaper collections
Simon Hengchen,* Ruben Ros,** Jani Marjanen,*** Mikko Tolonen***
*Språkbanken Text, University of Gothenburg, Sweden and iguanodon.ai, Belgium: firstname.lastname@gu.se
**Centre for Contemporary and Digital History (C2DH), University of Luxembourg: firstname.lastname@uni.lu
***COMHIS, University of Helsinki: firstname.lastname@helsinki.fi;
These are the supplementary materials for the DH2019 paper A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950, as well as the 2021 Digital Scholarship in the Humanities publication available in OpenAccess: https://academic.oup.com/dsh/article/36/Supplement_2/ii109/6421793. If you end up using whole or parts of this resource, please use the following citation(s):
and/or:
or alternatively use one of the following bib
s:
@inproceedings{hengchen2019nation,
title="A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.",
author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani},
year={2019},
address = "Utrecht, The Netherlands",
booktitle={Proceedings of the Digital Humanities (DH) conference 2019}
}
@article{hengchen2021data,
title={A data-driven approach to studying changing vocabularies in historical newspaper collections},
author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani and Tolonen, Mikko},
journal={Digital Scholarship in the Humanities},
volume={36},
number={Supplement\_2},
pages={ii109--ii126},
year={2021},
publisher={Oxford University Press}
}
Files
This archive contains two folders -- one per diachronic representation method -- as well as this README. The folders each contain four folders, which contain the models for their respective languages. As can be inferred from the small datasize, most of the earlier models are not reliable and should not be used, but are still made available. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Source material
Finnish:
The models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.
Filesizes:
[simon@taito-login3 SGNS]$ du -h fi*
12M fi_1820_SGNS_corpus_file.gensim
89M fi_1840_SGNS_corpus_file.gensim
797M fi_1860_SGNS_corpus_file.gensim
7.0G fi_1880_SGNS_corpus_file.gensim
22G fi_1900_SGNS_corpus_file.gensim
Swedish:
The models were created with data from the Kubhist 2 corpus (Språkbanken) -- more precisely, the data dumps available at https://spraakbanken.gu.se. After a manual evaluation of Swedish embeddings trained without pre-processing seemed to show that the embeddings were of low quality, we retrained models, only keeping sentences that were at least 10 tokens long and were constituted of at least 50% of lemmas as per the KORP processing pipeline (Borin et al, 2012).
Filesizes:
[simon@taito-login3 SGNS]$ du -h sv*
1.6M sv_1740_SGNS_corpus_file.gensim
44M sv_1760_SGNS_corpus_file.gensim
124M sv_1780_SGNS_corpus_file.gensim
228M sv_1800_SGNS_corpus_file.gensim
678M sv_1820_SGNS_corpus_file.gensim
1.6G sv_1840_SGNS_corpus_file.gensim
4.5G sv_1860_SGNS_corpus_file.gensim
6.5G sv_1880_SGNS_corpus_file.gensim
113M sv_1900_SGNS_corpus_file.gensim
Dutch:
The models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).
nl
or NL
as a language tag.de
.Our assumption was that most articles should contain de
at least once, and those that didn't were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably "shipping or financial reports" (thanks go to Melvin Wevers). We also did not include the colonial newspapers for our embeddings. This is motivated by our research questions. A list of removed newspapers is available on request.
Filesizes:
[simon@taito-login3 SGNS]$ du -h nl*
6.8M nl_1620_SGNS_corpus_file.gensim
7.9M nl_1640_SGNS_corpus_file.gensim
43M nl_1660_SGNS_corpus_file.gensim
78M nl_1680_SGNS_corpus_file.gensim
138M nl_1700_SGNS_corpus_file.gensim
243M nl_1720_SGNS_corpus_file.gensim
287M nl_1740_SGNS_corpus_file.gensim
431M nl_1760_SGNS_corpus_file.gensim
825M nl_1780_SGNS_corpus_file.gensim
1.2G nl_1800_SGNS_corpus_file.gensim
1.8G nl_1820_SGNS_corpus_file.gensim
3.1G nl_1840_SGNS_corpus_file.gensim
5.2G nl_1860_SGNS_corpus_file.gensim
13G nl_1880_SGNS_corpus_file.gensim
English:
The models were created with data from the British Library Newspapers collection (link), the Nichols collection (link), and the Burney collection (link). We used everything in the corpora. For English, only SGNS_ALIGN models are available. We thank Gale Cengage for their help with this project.
Filesizes:
[simon@taito-login3 SGNS]$ du -h en*
4.3M en_1620_SGNS_corpus_file.gensim
11M en_1640_SGNS_corpus_file.gensim
11M en_1660_SGNS_corpus_file.gensim
106M en_1680_SGNS_corpus_file.gensim
409M en_1700_SGNS_corpus_file.gensim
1.7G en_1720_SGNS_corpus_file.gensim
834M en_1740_SGNS_corpus_file.gensim
2.4G en_1760_SGNS_corpus_file.gensim
5.3G en_1780_SGNS_corpus_file.gensim
5.5G en_1800_SGNS_corpus_file.gensim
15G en_1820_SGNS_corpus_file.gensim
42G en_1840_SGNS_corpus_file.gensim
65G en_1860_SGNS_corpus_file.gensim
88G en_1880_SGNS_corpus_file.gensim
26G en_1900_SGNS_corpus_file.gensim
21G en_1920_SGNS_corpus_file.gensim
6.3G en_1940_SGNS_corpus_file.gensim
Word embeddings
For every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We train SGNS_UPDATE and SGNS_ALIGN models. Current research on German (Schlechtweg et al, 2019) and English (Shoemark et al, 2019) indicates you should use the SGNS_ALIGN models. For EN, FI, NL, no tokens (including punctuation) were removed nor altered, aside from lowercasing. For SV, see above. Parameters are as follows: SGNS architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs, 300 dimensions (or 100 for EN).
t
. To train the model for t+1
, we use the t
model to initialise the vectors for t+1
, set the learning rate to correspond to the end learning rate of t
, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.The Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.
## dict_files[key] is a dictionary with double decades as keys and a corresponding LineSentence object as value: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
count = 0
for key in sorted(list(dict_files.keys())):
if count == 0: ## This is the first model.
model = gensim.models.Word2Vec(corpus_file=dict_files[key], min_count=100, sg=1 ,size=300, workers=64, seed=1830, iter=5)
model.save(os.path.join(data_path_final,"KIM",lang+"_"+str(timebin)+".w2v"))
print("Model saved, on to the next
")
count += 1
if count > 0: ## this is for the subsequent models.
print("model for double decade starting in",str(key))
model =
Techsalerator’s News Event Data in Asia offers a detailed and expansive dataset designed to provide businesses, analysts, journalists, and researchers with comprehensive insights into significant news events across the Asian continent. This dataset captures and categorizes major events reported from a diverse range of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable perspectives on regional developments, economic shifts, political changes, and cultural occurrences.
Key Features of the Dataset: Extensive Coverage:
The dataset aggregates news events from a wide range of sources such as company press releases, industry-specific news outlets, blogs, PR sites, and traditional media. This broad coverage ensures a diverse array of information from multiple reporting channels. Categorization of Events:
News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly find and analyze information relevant to their interests or sectors. Real-Time Updates:
The dataset is updated regularly to include the most current events, ensuring users have access to the latest news and can stay informed about recent developments as they happen. Geographic Segmentation:
Events are tagged with their respective countries and regions within Asia. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:
Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps users understand the context and significance of each event. Historical Data:
The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into the evolution of news events. Advanced Search and Filter Options:
Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Asian Countries and Territories Covered: Central Asia: Kazakhstan Kyrgyzstan Tajikistan Turkmenistan Uzbekistan East Asia: China Hong Kong (Special Administrative Region of China) Japan Mongolia North Korea South Korea Taiwan South Asia: Afghanistan Bangladesh Bhutan India Maldives Nepal Pakistan Sri Lanka Southeast Asia: Brunei Cambodia East Timor (Timor-Leste) Indonesia Laos Malaysia Myanmar (Burma) Philippines Singapore Thailand Vietnam Western Asia (Middle East): Armenia Azerbaijan Bahrain Cyprus Georgia Iraq Israel Jordan Kuwait Lebanon Oman Palestine Qatar Saudi Arabia Syria Turkey (partly in Europe, but often included in Asia contextually) United Arab Emirates Yemen Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and identify emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Asia, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Asian news and events. Techsalerator’s News Event Data in Asia is a crucial resource for accessing and analyzing significant news events across the continent. By offering detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Kannada Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Kannada language.
Dataset Contain & Diversity:Containing a total of 5000 images, this Kannada OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Kannada text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Kannada people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Kannada text recognition models.
Update & Custom Collection:We're committed to expanding this dataset by continuously adding more images with the assistance of our native Kannada crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Kannada language. Your journey to enhanced language understanding and processing starts here.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a sample dataset that contains the reference information among various information sources such as TV, Newspaper and online articles.
There are two datasets 1. InputFileEdges.csv contains the information about the edges between nodes. The fields in this dataset are as follows: (i) from: source (or) starting node id of the edge (ii) to: target (or) ending node id of the edge (iii) weight: the number of times they were connected (or) referenced each other (iv) type: the type of the link (hyperlink or mention) between these nodes
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc.
https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">
Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.
Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.
If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Newspaper articles relating to Sir Ross Smith and the 1919 Epic Flight from England to Australia. Datasets are divided into themes of prelude to the epic flight, the epic flight and death and funeral of Sir Ross Smith. Articles are sourced from South Australian newspapers The Advertiser, Daily Herald, The Observer and The Register.
Citizens' economic perceptions can shape their political and economic behavior, making those perceptions' origins an important question. Research commonly posits that media coverage is a central source. Here, we test that prospect while considering the alternative hypothesis that media coverage instead echoes public perceptions. This paper applies a straightforward automated measure of the tone of economic coverage to 490,039 articles from 24 national and local media outlets over more than three decades. By matching the 245,947 survey respondents in the Survey of Consumer Attitudes and Behavior to measures of contemporaneous media coverage, we can assess the sequencing of changes in media coverage and public perceptions. Together, these data illustrate that newspaper coverage does not systematically precede public perceptions of the economy, a finding which analyses of television transcripts reinforce. Neither national nor local newspapers appear to strongly influence economic perceptions.
Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.
Key Features of the Dataset: Comprehensive Coverage:
The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:
News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:
The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:
Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:
Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:
The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:
Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
The newspaper with the highest print circulation in the United States in the six months running to September 2023 was The Wall Street Journal, with an average weekday print circulation of 555.2 thousand. Ranking second was The New York Times, followed by The New York Post. The paper in the ranking with the highest year-over-year drop in circulation was The Denver Post with a decline of 25 percent (although Buffalo News recorded a higher drop, data does not refer to September 2022 to September 2023, see notes).
https://brightdata.com/licensehttps://brightdata.com/license
Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.
Dataset Features
News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.
Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.
Popular Use Cases
Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.
Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.