100+ datasets found

b
News Datasets
brightdata.com
.json, .csv, .xlsx
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, News Datasets [Dataset]. https://brightdata.com/products/datasets/news
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset authored and provided by
Bright Data
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.

Dataset Features

News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.

Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.

Popular Use Cases

Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.

Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.
c
Newspaper Collection - Datasets - CLARIAH Labs Dataset Registry
mediasuitedata.clariah.nl
Updated May 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
clariah.nl (2016). Newspaper Collection - Datasets - CLARIAH Labs Dataset Registry [Dataset]. https://mediasuitedata.clariah.nl/dataset/kb-newspapers-test
Explore at:
Dataset updated
May 10, 2016
Description
This is the Newspaper collection of The National Library of The Netherlands (KB). "The KB promotes the visibility, usability and longevity of the Dutch Library Collection, defined as the collective holdings of all publicly funded libraries in the Netherlands" (KB mission statement). The following figures give answers to common questions about the composition of this collection: What part of the collection is included in the Media Suite? The Media Suite gives access to the KB's newspaper "basic collection". "The basic collection contains approximately 11 million newspaper pages from the Netherlands, the Dutch East Indies, the Antilles, America and Surinam from 1618 to 1995. This is about 15% of all newspapers that have ever been published in the Netherlands" (KB "wat zit er in Delpher?"). What years does the archive cover? The KB newspaper basic collection includes newspapers from 1618 to 1995. The Media Suite harvested all the items available and integrated them into the Media Suite in May 2018. Figure 1: Number of newspaper articles in the collection over time How and how often is the data updated in the Media Suite? The collection's metadata and their OCR enrichments are made available to the CLARIAH Media Suite by the KB via their harvesting end-point (OAI-PMH). The latest update to the Media Suite's data from EYE Jean Desmet film collection has been done in May, 2018. What kind of media is included? The collection includes newspaper content of different types: articles, advertisements, illustrations with captions, and obituaries). Figure 2: Types of content in the KB newspaper basic collection What portion of the collection is digital? A big part of the KB newspaper basic collection is digital. The KB is progressively digitizing more newspapers (KB "wat zit er in Delpher?"). Via the Media Suite, users can access the digitized newspapers in the KB Delpher search engine. Does the collection include enrichments? This collection has undergone object character recognition (OCR) processes. The OCR output is available via the Media Suite for searching purposes only. To read the OCR, users are redirected to the KB Delpher search engine. Figure 3: Proportion of OCR-ed content in the KB newspaper basic collection Where to find more information? KB newspaper collection site (in English) (in Dutch) KB Delpher (newspapers) search engine KB information about "what is available via Delpher?"
Potrika: Largest Bengali Newspaper Datasets
kaggle.com
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Virus_Proton (2024). Potrika: Largest Bengali Newspaper Datasets [Dataset]. https://www.kaggle.com/datasets/sabbirhossainujjal/potrika-bangla-newspaper-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Virus_Proton
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Largest Bengali Newspaper Dataset for news type classification.

Abstract:

Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, Inter-national, Entertainment, Economy, Education, Politics, and Science & Technology) providing five attributes (News Article, Category, Headline, Publication Date, and newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

cite: @misc{ahmad2022potrika, title={Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes}, author={Istiak Ahmad and Fahad AlQurashi and Rashid Mehmood}, year={2022}, eprint={2210.09389}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Dataset Source - Here
r
German newspaper data - die Welt and the Süddeutsche Zeitung
researchdata.se
data.europa.eu
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inger Rosengren (2025). German newspaper data - die Welt and the Süddeutsche Zeitung [Dataset]. http://doi.org/10.5878/1na7-kx34
Explore at:
(338976)Available download formats
Unique identifier
https://doi.org/10.5878/1na7-kx34
Dataset updated
Mar 14, 2025
Dataset provided by
Lund University
Authors
Inger Rosengren
Time period covered
Nov 1, 1967 - Oct 31, 1968
Area covered
Germany
Description
The data-file include running text from a representative sample of two German newspapers, Die Welt and Süddeutsche Zeitung, during the period 1 November 1966 to 30 October 1967.
F
English Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/english-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the English Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the English language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this English OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible English text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native English Speaking people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native English language crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the English language. Your journey to enhanced language understanding and processing starts here.
P
RealNews Dataset
paperswithcode.com
opendatalab.com
Updated May 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi (2019). RealNews Dataset [Dataset]. https://paperswithcode.com/dataset/realnews
Explore at:
Dataset updated
May 31, 2019
Authors
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi
Description
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
Z
Data from: GLAM-Workbench/trove-newspapers-data-post-54
data.niaid.nih.gov
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherratt, Tim (2024). GLAM-Workbench/trove-newspapers-data-post-54 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6812811
Explore at:
Dataset updated
Sep 14, 2024
Dataset authored and provided by
Sherratt, Tim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
newspapers with articles published after 1954
Current version: v1.7 Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the 'copyright cliff of death'). The data was extracted from the Trove API using this notebook from the Trove newspapers section of the GLAM Workbench. The data is available as a CSV file entitled newspapers_post_54.csv and contains the following fields:

title – the full title of the newspaper state – the state in which the newspaper was published id – Trove's unique identifier for this newspaper startDate – the earliest date of articles from this newspaper available in Trove endDate – the latest date of articles from this newspaper available in Trove issn – ISSN number_of_articles – the number of articles from this newspaper published after 1954 available in Trove troveUrl – link to more information about this newspaper

This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.
b
Data from: Women are seen more than heard in online newspapers - Datasets -...
data.bris.ac.uk
Updated Jan 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Data from: Women are seen more than heard in online newspapers - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/c9db4a593d0d5350cabc2dc8f1c26a4c
Explore at:
Dataset updated
Jan 27, 2017
Description
URLs for the 2,353,652 news articles covered in the study, as collected from the main feeds of the online news outlets.
H
Replication Data for: Newspaper Consumption in Print and Online: Printed...
dataverse.harvard.edu
Updated Sep 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Assumpção; S. Alfinito; B. Castro (2019). Replication Data for: Newspaper Consumption in Print and Online: Printed Newspapers is Status and Online is Easiness [Dataset]. http://doi.org/10.7910/DVN/I9JSHV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/I9JSHV
Dataset updated
Sep 2, 2019
Dataset provided by
Harvard Dataverse
Authors
M. Assumpção; S. Alfinito; B. Castro
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Online newspapers are becoming increasingly popular and may pose a threat to the traditional print newspaper market. To investigate this market, this study aims to assess how motivational variables (i.e., human values and social axioms) and affective and rational judgments comparatively influence the use of print and online newspapers. Therefore, we have applied the Consumer Cultural Influence Model (CCIM) to this subject. Our research investigates print and online newspaper usage in two different ways. One is exploratory designed to identify newspaper attributes through 11 interviews, and the other uses an online survey (N=498) to evaluate the relationships between the model’s constructs. The analyses conducted using structural equation modeling, demonstrate that the usage for each type of newspaper is different. Print newspapers involve affective judgment and the establishing of an emotional attachment to the product for those who have a preference for print newspapers. In terms of online newspapers, the relationship is rational for those who prefer online newspapers. The originality of this research has to do with its examination of the perspective of the newspaper consumer, and its identification of opposing idiosyncrasies associated with these differing preferences. It also applies comparative models to this market, dealing not only with newspaper attributes, but also subjective aspects linked to newspaper consumption.
Z
Viral Culture in Early Nineteenth-Century Europe newspaper dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiiskinen, Harri (2024). Viral Culture in Early Nineteenth-Century Europe newspaper dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6697270
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Salmi, Hannu
Nivala, Asko
Aho, Marius
Kiiskinen, Harri
Ristilä, Anna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Europe
Description
Dataset produced during the project Viral Culture in Early Nineteenth-Century Europe.

The project traced text reuse by analysing large OCR'd newspaper collections using a BLAST based algorithm. This algorithm produces text clusters.

This dataset contains two produced cluster datasets based on two different data collections.

For the first dataset, the Austrian ANNO newspaper collection, this dataset contains metadata describing the used newspapers.

For the second dataset, German-language newspapers in the Europeana collection, this dataset contains project produced metadata describing the newspapers used by the project, as well as the OCR's content for these newspaper issues. The OCR is produced with Tesseract OCR from digital page images downloaded from the Europeana services.
The New York Times TDM Archive
redivis.com
application/jsonl +7
Updated May 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University Libraries (2022). The New York Times TDM Archive [Dataset]. http://doi.org/10.57761/zjj8-7y38
Explore at:
stata, sas, spss, arrow, csv, application/jsonl, parquet, avroAvailable download formats
Unique identifier
https://doi.org/10.57761/zjj8-7y38
Dataset updated
May 17, 2022
Dataset provided by
Redivis Inc.
Authors
Stanford University Libraries
Time period covered
Jan 14, 1970 - Feb 15, 2022
Description
Abstract

41-Year Textual Digital Archive of nytimes.com, which consists of all available articles (approximately 4,000,000) published by The New York Times, including but not limited to news, lifestyle, opinion and The New York Times Magazine, and excludes reader comments, paid obituaries and the kids section. Article data is available from 1980-2021.

Methodology

The New York Times TDM Archive was originally received as NIFT-encoded XML objects. See 'Bulk Data Access' (below) for more information.

The xml was transformed into tabular data (for inclusion in Redivis) using xmltotabular with the accompanying configuration file.

Bulk Data Access

Data access is required to view this section.
Models for "A data-driven approach to studying changing vocabularies in...
zenodo.org
explore.openaire.eu
+1more
tar
Updated Jan 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Hengchen; Simon Hengchen; Ruben Ros; Ruben Ros; Jani Marjanen; Jani Marjanen; Mikko Tolonen; Mikko Tolonen (2022). Models for "A data-driven approach to studying changing vocabularies in historical newspaper collections" [Dataset]. http://doi.org/10.5281/zenodo.3585027
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3585027
Dataset updated
Jan 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simon Hengchen; Simon Hengchen; Ruben Ros; Ruben Ros; Jani Marjanen; Jani Marjanen; Mikko Tolonen; Mikko Tolonen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NOTE: This is a badly rendered version of the README within the archive.

A data-driven approach to studying changing vocabularies in historical newspaper collections

Simon Hengchen,* Ruben Ros,** Jani Marjanen,*** Mikko Tolonen***

*Språkbanken Text, University of Gothenburg, Sweden and iguanodon.ai, Belgium: firstname.lastname@gu.se
**Centre for Contemporary and Digital History (C2DH), University of Luxembourg: firstname.lastname@uni.lu
***COMHIS, University of Helsinki: firstname.lastname@helsinki.fi;

These are the supplementary materials for the DH2019 paper A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950, as well as the 2021 Digital Scholarship in the Humanities publication available in OpenAccess: https://academic.oup.com/dsh/article/36/Supplement_2/ii109/6421793. If you end up using whole or parts of this resource, please use the following citation(s):

Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands

and/or:

Hengchen, S., Ros, R., Marjanen, J. and Tolonen, M., 2021. A data-driven approach to studying changing vocabularies in historical newspaper collections. Digital Scholarship in the Humanities, 36(Supplement_2), pp.ii109-ii126.

or alternatively use one of the following bibs:

@inproceedings{hengchen2019nation, title="A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.", author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani}, year={2019}, address = "Utrecht, The Netherlands", booktitle={Proceedings of the Digital Humanities (DH) conference 2019} }

@article{hengchen2021data, title={A data-driven approach to studying changing vocabularies in historical newspaper collections}, author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani and Tolonen, Mikko}, journal={Digital Scholarship in the Humanities}, volume={36}, number={Supplement\_2}, pages={ii109--ii126}, year={2021}, publisher={Oxford University Press} }

Files

This archive contains two folders -- one per diachronic representation method -- as well as this README. The folders each contain four folders, which contain the models for their respective languages. As can be inferred from the small datasize, most of the earlier models are not reliable and should not be used, but are still made available. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Source material

Finnish:

The models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.

Filesizes:

[simon@taito-login3 SGNS]$ du -h fi* 12M fi_1820_SGNS_corpus_file.gensim 89M fi_1840_SGNS_corpus_file.gensim 797M fi_1860_SGNS_corpus_file.gensim 7.0G fi_1880_SGNS_corpus_file.gensim 22G fi_1900_SGNS_corpus_file.gensim

Swedish:

The models were created with data from the Kubhist 2 corpus (Språkbanken) -- more precisely, the data dumps available at https://spraakbanken.gu.se. After a manual evaluation of Swedish embeddings trained without pre-processing seemed to show that the embeddings were of low quality, we retrained models, only keeping sentences that were at least 10 tokens long and were constituted of at least 50% of lemmas as per the KORP processing pipeline (Borin et al, 2012).

Filesizes:

[simon@taito-login3 SGNS]$ du -h sv* 1.6M sv_1740_SGNS_corpus_file.gensim 44M sv_1760_SGNS_corpus_file.gensim 124M sv_1780_SGNS_corpus_file.gensim 228M sv_1800_SGNS_corpus_file.gensim 678M sv_1820_SGNS_corpus_file.gensim 1.6G sv_1840_SGNS_corpus_file.gensim 4.5G sv_1860_SGNS_corpus_file.gensim 6.5G sv_1880_SGNS_corpus_file.gensim 113M sv_1900_SGNS_corpus_file.gensim

Dutch:

The models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).

For anything pre-1877 we discarded full texts that had, in the metadata, anything else than exclusively nl or NL as a language tag.

For the full texts between 1877 and 1899: we queried the API for all items in the “artikel” category that contained the determiner de.

Our assumption was that most articles should contain de at least once, and those that didn't were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably "shipping or financial reports" (thanks go to Melvin Wevers). We also did not include the colonial newspapers for our embeddings. This is motivated by our research questions. A list of removed newspapers is available on request.

Filesizes:

[simon@taito-login3 SGNS]$ du -h nl* 6.8M nl_1620_SGNS_corpus_file.gensim 7.9M nl_1640_SGNS_corpus_file.gensim 43M nl_1660_SGNS_corpus_file.gensim 78M nl_1680_SGNS_corpus_file.gensim 138M nl_1700_SGNS_corpus_file.gensim 243M nl_1720_SGNS_corpus_file.gensim 287M nl_1740_SGNS_corpus_file.gensim 431M nl_1760_SGNS_corpus_file.gensim 825M nl_1780_SGNS_corpus_file.gensim 1.2G nl_1800_SGNS_corpus_file.gensim 1.8G nl_1820_SGNS_corpus_file.gensim 3.1G nl_1840_SGNS_corpus_file.gensim 5.2G nl_1860_SGNS_corpus_file.gensim 13G nl_1880_SGNS_corpus_file.gensim

English:

The models were created with data from the British Library Newspapers collection (link), the Nichols collection (link), and the Burney collection (link). We used everything in the corpora. For English, only SGNS_ALIGN models are available. We thank Gale Cengage for their help with this project.

Filesizes:

[simon@taito-login3 SGNS]$ du -h en* 4.3M en_1620_SGNS_corpus_file.gensim 11M en_1640_SGNS_corpus_file.gensim 11M en_1660_SGNS_corpus_file.gensim 106M en_1680_SGNS_corpus_file.gensim 409M en_1700_SGNS_corpus_file.gensim 1.7G en_1720_SGNS_corpus_file.gensim 834M en_1740_SGNS_corpus_file.gensim 2.4G en_1760_SGNS_corpus_file.gensim 5.3G en_1780_SGNS_corpus_file.gensim 5.5G en_1800_SGNS_corpus_file.gensim 15G en_1820_SGNS_corpus_file.gensim 42G en_1840_SGNS_corpus_file.gensim 65G en_1860_SGNS_corpus_file.gensim 88G en_1880_SGNS_corpus_file.gensim 26G en_1900_SGNS_corpus_file.gensim 21G en_1920_SGNS_corpus_file.gensim 6.3G en_1940_SGNS_corpus_file.gensim

Word embeddings

For every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We train SGNS_UPDATE and SGNS_ALIGN models. Current research on German (Schlechtweg et al, 2019) and English (Shoemark et al, 2019) indicates you should use the SGNS_ALIGN models. For EN, FI, NL, no tokens (including punctuation) were removed nor altered, aside from lowercasing. For SV, see above. Parameters are as follows: SGNS architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs, 300 dimensions (or 100 for EN).

For SGNS_UPDATE: We first train a model for the first time bin t. To train the model for t+1, we use the t model to initialise the vectors for t+1, set the learning rate to correspond to the end learning rate of t, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.

The Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.

## dict_files[key] is a dictionary with double decades as keys and a corresponding LineSentence object as value: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence count = 0 for key in sorted(list(dict_files.keys())): if count == 0: ## This is the first model. model = gensim.models.Word2Vec(corpus_file=dict_files[key], min_count=100, sg=1 ,size=300, workers=64, seed=1830, iter=5) model.save(os.path.join(data_path_final,"KIM",lang+"_"+str(timebin)+".w2v")) print("Model saved, on to the next ") count += 1 if count > 0: ## this is for the subsequent models. print("model for double decade starting in",str(key)) model =
News Events Data in Asia ( Techsalerator)
datarade.ai
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Techsalerator (2024). News Events Data in Asia ( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-asia-techsalerator-techsalerator
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Techsalerator LLC
Authors
Techsalerator
Area covered
United Arab Emirates, Uzbekistan, Kyrgyzstan, Timor-Leste, Brunei Darussalam, Kazakhstan, Maldives, China, Iran (Islamic Republic of), Hong Kong
Description
Techsalerator’s News Event Data in Asia offers a detailed and expansive dataset designed to provide businesses, analysts, journalists, and researchers with comprehensive insights into significant news events across the Asian continent. This dataset captures and categorizes major events reported from a diverse range of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable perspectives on regional developments, economic shifts, political changes, and cultural occurrences.

Key Features of the Dataset: Extensive Coverage:

The dataset aggregates news events from a wide range of sources such as company press releases, industry-specific news outlets, blogs, PR sites, and traditional media. This broad coverage ensures a diverse array of information from multiple reporting channels. Categorization of Events:

News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly find and analyze information relevant to their interests or sectors. Real-Time Updates:

The dataset is updated regularly to include the most current events, ensuring users have access to the latest news and can stay informed about recent developments as they happen. Geographic Segmentation:

Events are tagged with their respective countries and regions within Asia. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps users understand the context and significance of each event. Historical Data:

The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into the evolution of news events. Advanced Search and Filter Options:

Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Asian Countries and Territories Covered: Central Asia: Kazakhstan Kyrgyzstan Tajikistan Turkmenistan Uzbekistan East Asia: China Hong Kong (Special Administrative Region of China) Japan Mongolia North Korea South Korea Taiwan South Asia: Afghanistan Bangladesh Bhutan India Maldives Nepal Pakistan Sri Lanka Southeast Asia: Brunei Cambodia East Timor (Timor-Leste) Indonesia Laos Malaysia Myanmar (Burma) Philippines Singapore Thailand Vietnam Western Asia (Middle East): Armenia Azerbaijan Bahrain Cyprus Georgia Iraq Israel Jordan Kuwait Lebanon Oman Palestine Qatar Saudi Arabia Syria Turkey (partly in Europe, but often included in Asia contextually) United Arab Emirates Yemen Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and identify emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Asia, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Asian news and events. Techsalerator’s News Event Data in Asia is a crucial resource for accessing and analyzing significant news events across the continent. By offering detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
F
Kannada Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Kannada Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/kannada-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Kannada Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Kannada language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Kannada OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Kannada text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Kannada people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Kannada text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Kannada crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Kannada language. Your journey to enhanced language understanding and processing starts here.
Network Analysis Data From Various Sources
kaggle.com
Updated Mar 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAMA (2021). Network Analysis Data From Various Sources [Dataset]. https://www.kaggle.com/rahulgoel1106/network-analysis-data-from-various-sources/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
RAMA
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Context

This is a sample dataset that contains the reference information among various information sources such as TV, Newspaper and online articles.

Content

There are two datasets 1. InputFileEdges.csv contains the information about the edges between nodes. The fields in this dataset are as follows: (i) from: source (or) starting node id of the edge (ii) to: target (or) ending node id of the edge (iii) weight: the number of times they were connected (or) referenced each other (iv) type: the type of the link (hyperlink or mention) between these nodes

InputFileNodes.csv contains the information about the nodes. The fields in this dataset are as follows: (i) id: unique node id (ii) media: the media information of the node (e.g. NY Times, Washington Post, Wall Street Journal etc.) (iii) media.type: the type of the media (1 represents Newspaper; 2 represents TV; and 3 represents Online) (iv) type.label: the type of the media (Newspaper, TV, and Online) (v) audience.size: the audience size for each media
mediabias
kaggle.com
Updated Jul 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Tegmark (2022). mediabias [Dataset]. http://doi.org/10.34740/kaggle/dsv/3966214
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/3966214
Dataset updated
Jul 20, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Max Tegmark
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Many people believe that news media they dislike are biased, while their favorite news source isn't. Can we move beyond such subjectivity and measure media bias objectively, from data alone? The auto-generated figure below answers this question with a resounding "yes", showing left-leaning media on the left, right-leaning media on the right, establishment-critical media at the bottom, etc. https://space.mit.edu/home/tegmark/phrasebias.jpg" alt="Media bias landscape">

Our algorithm analyzed over a million articles from over a hundred newspapers. It first audo-identifies phrases that help predict which newspaper a givens article is from (e.g. "undocumented immigrant" vs. "illegal immigrant"). It then analyzes the frequencies of such phrases across newspapers and topics, producing the media bias landscape below. This means that although news bias is inherently political, its measurement need not be.

Here's our paper: arXiv:2109.00024. Our Kaggle data set here contains the discriminative phrases and phrase counts needed to reproduce all the plots in our paper. The files contain the following data: - The directory phrase_selection contains tables such as immigration_phrases.csv that you can open with Microsoft Excel. They contain the phrases that our method found most informative for predicting which newspaper an article is from, sorted by decreasing utility. Our analysis ones only the ones passing all our screenings, i.e., with ones in columns D, E and F. - The directory counts contains tables such as immigration_counts.csv, listing the number of times that each phrase in occurs in each newspaper's coverage of that topic. - The file blacklist.csv contains journalist names and other phrases that were discarded because they helped revealed the identity of a newspaper without reflecting any political bias.

If you have questions, please contact Samantha at sdalonzo@mit.edu or Max at tegmark@mit.edu.
Newspaper Articles relating to Sir Ross Smith - Dataset - data.sa.gov.au
data.sa.gov.au
Updated Jun 24, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.sa.gov.au (2019). Newspaper Articles relating to Sir Ross Smith - Dataset - data.sa.gov.au [Dataset]. https://data.sa.gov.au/data/dataset/newspaper-articles-relating-to-sir-ross-smith
Explore at:
Dataset updated
Jun 24, 2019
Dataset provided by
Government of South Australiahttp://sa.gov.au/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia, South Australia
Description
Newspaper articles relating to Sir Ross Smith and the 1919 Epic Flight from England to Australia. Datasets are divided into themes of prelude to the epic flight, the epic flight and death and funeral of Sir Ross Smith. Articles are sourced from South Australian newspapers The Advertiser, Daily Herald, The Observer and The Register.
d
Replication Data for: Does Newspaper Coverage Influence or Reflect Public...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hopkins, Daniel; Kim, Eunji; Kim, Soojong (2023). Replication Data for: Does Newspaper Coverage Influence or Reflect Public Perceptions of the Economy? [Dataset]. http://doi.org/10.7910/DVN/22O4DB
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/22O4DB
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Hopkins, Daniel; Kim, Eunji; Kim, Soojong
Description
Citizens' economic perceptions can shape their political and economic behavior, making those perceptions' origins an important question. Research commonly posits that media coverage is a central source. Here, we test that prospect while considering the alternative hypothesis that media coverage instead echoes public perceptions. This paper applies a straightforward automated measure of the tone of economic coverage to 490,039 articles from 24 national and local media outlets over more than three decades. By matching the 245,947 survey respondents in the Survey of Consumer Attitudes and Behavior to measures of contemporaneous media coverage, we can assess the sequencing of changes in media coverage and public perceptions. Together, these data illustrate that newspaper coverage does not systematically precede public perceptions of the economy, a finding which analyses of television transcripts reinforce. Neither national nor local newspapers appear to strongly influence economic perceptions.
News Events Data in Latin America( Techsalerator)
datarade.ai
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Techsalerator (2024). News Events Data in Latin America( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-latin-america-techsalerator-techsalerator
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
Techsalerator LLC
Authors
Techsalerator
Area covered
Chile, Martinique, Montserrat, Dominican Republic, Cuba, Falkland Islands (Malvinas), Aruba, Argentina, French Guiana, Ecuador, Americas, Latin America
Description
Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.

Key Features of the Dataset: Comprehensive Coverage:

The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:

News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:

The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:

Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:

The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:

Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
Leading daily newspapers in the U.S. 2023, by print circulation
statista.com
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading daily newspapers in the U.S. 2023, by print circulation [Dataset]. https://www.statista.com/statistics/272790/circulation-of-the-biggest-daily-newspapers-in-the-us/
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
The newspaper with the highest print circulation in the United States in the six months running to September 2023 was The Wall Street Journal, with an average weekday print circulation of 555.2 thousand. Ranking second was The New York Times, followed by The New York Post. The paper in the ranking with the highest year-over-year drop in circulation was The Denver Post with a decline of 25 percent (although Buffalo News recorded a higher drop, data does not refer to September 2022 to September 2023, see notes).

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data, News Datasets [Dataset]. https://brightdata.com/products/datasets/news

News Datasets

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset authored and provided by

Bright Data

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

Stay ahead with our comprehensive News Dataset, designed for businesses, analysts, and researchers to track global events, monitor media trends, and extract valuable insights from news sources worldwide.

Dataset Features

News Articles: Access structured news data, including headlines, summaries, full articles, publication dates, and source details. Ideal for media monitoring and sentiment analysis. Publisher & Source Information: Extract details about news publishers, including domain, region, and credibility indicators. Sentiment & Topic Classification: Analyze news sentiment, categorize articles by topic, and track emerging trends in real time. Historical & Real-Time Data: Retrieve historical archives or access continuously updated news feeds for up-to-date insights.

Customizable Subsets for Specific Needs Our News Dataset is fully customizable, allowing you to filter data based on publication date, region, topic, sentiment, or specific news sources. Whether you need broad coverage for trend analysis or focused data for competitive intelligence, we tailor the dataset to your needs.

Popular Use Cases

Media Monitoring & Reputation Management: Track brand mentions, analyze media coverage, and assess public sentiment. Market & Competitive Intelligence: Monitor industry trends, competitor activity, and emerging market opportunities. AI & Machine Learning Training: Use structured news data to train AI models for sentiment analysis, topic classification, and predictive analytics. Financial & Investment Research: Analyze news impact on stock markets, commodities, and economic indicators. Policy & Risk Analysis: Track regulatory changes, geopolitical events, and crisis developments in real time.

Whether you're analyzing market trends, monitoring brand reputation, or training AI models, our News Dataset provides the structured data you need. Get started today and customize your dataset to fit your business objectives.

Clear search

Close search

Google apps

Main menu

News Datasets

Newspaper Collection - Datasets - CLARIAH Labs Dataset Registry

Potrika: Largest Bengali Newspaper Datasets

German newspaper data - die Welt and the Süddeutsche Zeitung

English Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

RealNews Dataset

Data from: GLAM-Workbench/trove-newspapers-data-post-54

Data from: Women are seen more than heard in online newspapers - Datasets -...

Replication Data for: Newspaper Consumption in Print and Online: Printed...

Viral Culture in Early Nineteenth-Century Europe newspaper dataset

The New York Times TDM Archive

Abstract

Methodology

Bulk Data Access

Models for "A data-driven approach to studying changing vocabularies in...

News Events Data in Asia ( Techsalerator)

Kannada Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Network Analysis Data From Various Sources

Context

Content

mediabias

Newspaper Articles relating to Sir Ross Smith - Dataset - data.sa.gov.au

Replication Data for: Does Newspaper Coverage Influence or Reflect Public...

News Events Data in Latin America( Techsalerator)

Leading daily newspapers in the U.S. 2023, by print circulation

News Datasets