Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Digital Archiving: Libraries, museums or any organization maintaining a collection of old newspapers can use this model for digitizing their archive. By recognizing articles, headlines, and other key features, they can create searchable digital records of historical newspapers.
Media Monitoring: Companies, PR agencies, or researchers can use the model to track coverage of specific topics or entities. For instance, they can analyse how often certain topics are covered, who talks about them, and what is being said.
Educational Tools: Educators can use the model in teaching resources like history, journalism, or literature classes. This could help students explore certain eras or events by quickly analyzing articles from historic newspapers related to those periods.
News Trend Analysis: The model could be used by news analysts or data scientists to understand trends in news coverage over time. For instance, one could analyze coverage of climate change over the years to identify shifts in attitudes or language.
Accessibility: For people with visual impairments or reading disabilities, the model can assist in converting visual newspaper data into a format they can engage with, like text-to-speech software, enhancing their accessibility to such resources.
Facebook
TwitterThis is the Newspaper collection of The National Library of The Netherlands (KB). "The KB promotes the visibility, usability and longevity of the Dutch Library Collection, defined as the collective holdings of all publicly funded libraries in the Netherlands" (KB mission statement). The following figures give answers to common questions about the composition of this collection: What part of the collection is included in the Media Suite? The Media Suite gives access to the KB's newspaper "basic collection". "The basic collection contains approximately 11 million newspaper pages from the Netherlands, the Dutch East Indies, the Antilles, America and Surinam from 1618 to 1995. This is about 15% of all newspapers that have ever been published in the Netherlands" (KB "wat zit er in Delpher?"). What years does the archive cover? The KB newspaper basic collection includes newspapers from 1618 to 1995. The Media Suite harvested all the items available and integrated them into the Media Suite in May 2018. Figure 1: Number of newspaper articles in the collection over time How and how often is the data updated in the Media Suite? The collection's metadata and their OCR enrichments are made available to the CLARIAH Media Suite by the KB via their harvesting end-point (OAI-PMH). The latest update to the Media Suite's data from EYE Jean Desmet film collection has been done in May, 2018. What kind of media is included? The collection includes newspaper content of different types: articles, advertisements, illustrations with captions, and obituaries). Figure 2: Types of content in the KB newspaper basic collection What portion of the collection is digital? A big part of the KB newspaper basic collection is digital. The KB is progressively digitizing more newspapers (KB "wat zit er in Delpher?"). Via the Media Suite, users can access the digitized newspapers in the KB Delpher search engine. Does the collection include enrichments? This collection has undergone object character recognition (OCR) processes. The OCR output is available via the Media Suite for searching purposes only. To read the OCR, users are redirected to the KB Delpher search engine. Figure 3: Proportion of OCR-ed content in the KB newspaper basic collection Where to find more information? KB newspaper collection site (in English) (in Dutch) KB Delpher (newspapers) search engine KB information about "what is available via Delpher?"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents the content curation practices of four Spanish newspapers that are representative of the global media ecosystem: two legacy media outlets (El País and La Vanguardia) and two digital-native platforms (elDiario.es and El Español). The study focuses on how these newspapers employ content curation in their front-page news, analyzing both the content and curation dimensions. The primary objective is to examine content curation practices in the front-page news of digital press through a sample of Spanish newspapers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Logical-layout analysis on French Historical Newspapers
This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF).
Description
This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types:
1c: documents where the text is displayed in one column, as in books;
2c: documents where the text is displayed into two columns;
3c+: documents where there are at least 3 columns of text, as in newspapers.
Each of these folders contain subfolders starting with the letters ‘cb’. These are the identifier of a newspaper collection such as « Le Petit Semeur ». An XML describing the collection is contained in each of these folder, but is not linked to the logical-layout analysis purpose. They also contain subfolders starting with the letters ‘bpt’, which contain the following files:
XXX.xml : the original XML film as gathered from Gallica.
truelabels_block: A CSV file where the True labels for each TextBlock tag is given. Each line contains the page, the block_id, the first and last line of text of the block and its label
truelabels_line: A CSV file where the True labels for each TextLine tag is given. Each line contains the page, the line_id, the text of the line and its label
XXX_docbook.xml: the document after having been processed by a Logical Layout recognition system.
The original XML gathers multiple information about the document, especially metadata (described using the DublinCore schema), the page numbering and the OCR which is described with the XML ALTO format. As such, the files already provide the physical layout analysis and the reading order of the documents.
The XML ALTO format provides the text content and physical layout of documents in the following manner. The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes:
Id: the tag's identifier
Height, Width: the text height and width
Vpos: the vertical position of the text on the page. The higher the value, the lower the word is on the page
Hpos: the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page
Language: the language of the text (only for TextBlock tags).
The blocks of text are labelled either as Text, Title, Header or Other. The lines of text are labelled either as Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. These labels are used in the truelabel_lines.csv, trulabel_blocks.csv and XXX_docbook.xml files.
You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing the part with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/
Facebook
TwitterThis is a listing of Indigenous periodicals (newspapers, newsletters, magazines, and journals), arranged by title. It primarily includes material published in Canada, but also encompasses some titles from American states bordering Canada. The scope aims to include publications by Indigenous communities and organizations, and to exclude known material produced by governments and non-Indigenous organizations. The inventory represents known publications across Canada based on sources from OCLC, and known listings of these publications within the community. All items in the list are held in Canadian libraries, archives, and museums. The accuracy of these lists is unknown and not validated by Indigenous communities to our knowledge. The source data lists reflect the work of academic institutions describing the materials in their holdings. Indigenous communities may be listed as the primary creator, but this can only be validated upon investigation with the source materials and with Indigenous communities. The intent is threefold: to promote a list of Indigenous publications, and where they can be consulted or searched; to track digitization work by Canadian institutions and groups and facilitate digitization efforts in collaboration with relevant Indigenous communities; and to enable easy additions to, and corrections of, the list. It is important to note that this is not a search tool for the contents of the publications, but merely an inventory of titles, along with locations of the print and digital holdings. Data headings are Title, Title Family, In Scope, Status, Source of Information, Publisher/Issuing Org., Place of Publication, Province/State, Country, Print Run/Holdings, Notes, ISSN, OCLC Identifiers, Online, Format, Digitization Status, Canadian Repository Holdings, Language. For definitions of the headings, see The Dataset Document Workbook. This list stems from efforts by the Indigenous Historical Publications Working Group, working on behalf of the Council of Prairie and Pacific University Libraries (COPPUL). Input by Indigenous individuals, communities, organizations and publishers, as well as all researchers, libraries, archives, and museums is eagerly sought and welcomed. Please contact us for more information, comments, or to provide updates.
Facebook
TwitterMonthly usage figures for online resources including databases and e-book platforms when available, for January 2005 to present. Additional information Blank means no data available In 2020, all library buildings closed from 19 March included due to the coronavirus outbreak. Resources included : description {minimum dates of subscription} What the figure is 19th Century British Library Newspapers : digital newspaper archive {May 2007 - present} Number of sessions Access to Research : online journals {April 2014 - present} Number of pages viewed Ancestry : family history {October 2008 - present} Number of sessions until May 2015; number of content pages viewed from June 2015 Britannica Online : encyclopedia {January 2005? - present} Number of searches conducted, until June 2014; number of sessions from July 2014 British Standards {March 2005 - April 2017; November 2017 - present} Number of content pages viewed British Way of Life : information to help asylum seekers, refugees and migrants in getting settled in the UK {October 2016 - January 2023} Number of sessions - subscription ceased January 2023 Citizens Advice Notes : UK law made understandable {March 2007 - March 2016} Number of pages viewed COBRA : business information fact sheets and business sector profiles {October 2005 - present} Number of pages viewed Corporate researcher / Market IQ : company information database {January 2008 - 2015} Number of "reports viewed" EISODOS : information for foreigners coming to live in the UK {October 2008 - October 2013} Information on meaning of figure lost Enquire : "ask a librarian" online chat service {2005 - March 2016} Number of chats started by users in the Newcastle area Find my past : family history {April 2011 - present} Number of sessions (or so we seem to remember when we had access to usage figures) Go Citizen : replaces Life in Great Britain, citizenship test preparing for UK citizenship. {September 2023 - present} Number of tests taken IBISWorld : market research {January 2017 - present} Number of pages viewed Key Note : company information and market research {April 2011 - October 2018} Number of reports viewed Kompass : business information {2006 - July 2011} Information on meaning of figure lost Know UK : current reference information {January 2007 - June 2011} Information on meaning of figure lost Life in Great Britain : self-learn course to prepare for the Life in the UK citizenship test {January 2010 - January 2023} Number of sessions - subscription ceased January 2023 Local Data Online : business (retail sector) information {November 2013 - July 2015?} Number of queries per month. No longer receive stats on this as of July 2024. Mint UK & Mint Global : company information databases {March 2014 - 2015} Information on meaning of figure lost Mintel : market reports {2006? - April 2010; June 2013 - present} Number of reports viewed Newsstand : online newspapers {January 2011 - March 2014} Information on meaning of figure lost Onesource / Avention : company information database (changed name over the years) {March 2012 - October 2013; July 2015 - present} Number of searches conducted - Subscription ceased June 2024 News UK : newspaper articles {January 2007 - October 2010?} Information on meaning of figure lost Oxford English Dictionary {May 2006 - present} Number of sessions Oxford Art Online {March 2006 - present} Number of sessions Oxford Dictionaries {February 2015 - present} Number of sessions Oxford Dictionary of National Biography {January 2006 - present} Number of sessions Oxford Music Online {March 2006 - present} Number of sessions Oxford Reference Online {March 2006 - present} Number of sessions Safari Select : online books (to read online, as opposed to the e-books you can download and read offline) {May 2009 - March 2014} Number of books viewed Times Digital Archive : digitised newspapers {January 2005 - present} Number of sessions Theory Test Pro : practice questions for the driving theory test {August 2010 - present} Number of sessions Transparent language online / Byki : language courses {January 2011 - November 2012} Number of courses accessed Universal Skills : learn basic computer skills and how to use Universal Job Match {November 2014 - present} Number of users Newcastle Library App (devices) : number of devices the app is on {2013 - present} Newcastle Library App (launches) : number of times the app has been used {2013 - present} Bibliotheca Cloud Library : e-books and e-audiobooks {February 2016 - March 2018} Number of items borrowed Bolinda : e-audiobooks collection {2012 - February 2016} Number of items borrowed (figures only from April 2015) Bolinda BorrowBox e-books {February 2018 - present} Number of items borrowed Bolinda BorrowBox e-audiobooks {February 2018 - present} Number of items borrowed ComicsPlus : e-comic books {March 2017} Number of items borrowed - no longer record this, not sure when subscription ceased OneClick / RB Digital (e-audiobooks) : e-audiobooks collection (became RB Digital in... 2017?) {May 2015} Number of items borrowed - no longer record this, not sure when subscription ceased Overdrive (e-audiobooks) {2011 - May 2016} Number of items borrowed (figures only from April 2015) - subscription ceased January 2023016} Number of items borrowed (figures only from April 2015) - subscription ceased March 2023 Public Library Online : e-books collection {April 2016 - February 2018} Number of items borrowed Zinio / RB Digital (magazines) : digital magazines (the Zinio service became integrated with the other RB Digital content in 2017) {May 2015 - present} Number of magazines downloaded (figures only from January 2016)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ECPO joins several important digital collections of the early Chinese press and puts them into a single overarching framework. In the first phase, several databases on early women’s periodicals and entertainment publishing were created: “Chinese Women’s Magazines in the Late Qing and Early Republican Period” (WoMag), “Chinese Entertainment Newspapers” (Xiaobao), and databases hosted at the Academia Sinica in Taiwan. These systems approach the material in two ways: in the intensive approach we record all articles, images, advertisements, and related agents and assign them to a complete set of scanned pages, while in the extensive approach we record the main characteristic features of publications. ECPO is distinguished from other existing databases of Chinese periodicals in that it not only provides image scans but also preserves materials often excluded in reprint, microfilm, or digital (even full-text) editions, such as advertising inserts and illustrations. In addition, it aims at incorporating metadata in both English and Chinese, including keywords and biographical information on editors, authors and individuals represented in illustrations and advertisements in the journals. As the material basis of the database consists mostly of image scans, the project has been running experiments on one Republican newspaper to explore approaches toward full-text generation. Computer-aided processing of image scans of historical periodicals is still challenging with the current state of technology, in particular, because processing standards for Latin-script newspapers do not apply to the Chinese context. It is only with new approaches in machine learning that it is now possible to transform material that was previously inaccessible just a few years ago. However, many challenges remain. Extremely complex layouts resulting in difficulties for reliable automatic detection of page segmentation have prevented full-text generation for these newspapers even within China. The application of artificial intelligence requires a ground truth data set. This error-free, manually corrected text with structural information is used for evaluation and training of software models for text and layout recognition. In the fall of 2021, the project successfully implemented OCR on a newspaper 晶報 Jing bao (The Crystal) sample with a character error rate below 3% (Henke 2021). On that basis, the project is now expanding and generalizing its approach. With additional funding recently received from the Research Council Cultural Dynamics in Globalized Worlds for the first half of 2022, the project is currently producing a new data set. The project’s aim is to offer a solution to automatically produce full text from Republican newspapers using neural networks and machine learning. The project’s current work will further develop its original aims and contribute to the field of research as a whole. With the disclosure of the project’s network models and data sets, its results can be reproduced and evaluated, and others can adopt its approaches in the field. Although processing non-Latin-script is still a challenge in many cases, the project hopes its work may serve as good practice examples for such initiatives. The data set provides a first and complete extract of all metadata edited by the project so far. Future versions will also incorporate the fulltext produced in our OCR pipeline.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Online journalism is increasing every day. There are many news agencies, newspapers, and magazines using digital publication in the global network. Documents published online are available to users, who use search engines to find them. In order to deliver documents that are relevant to the search, they must be indexed and classified. Due to the vast number of documents published online every day, a lot of research has been carried out to find ways to facilitate automatic document classification. The objective of the present study is to describe an experimental approach for the automatic classification of journalistic documents published on the Internet using the Vector Space Model for document representation. The model was tested based on a real journalism database, using algorithms that have been widely reported in the literature. This article also describes the metrics used to assess the performance of these algorithms and their required configurations. The results obtained show the efficiency of the method used and justify further research to find ways to facilitate the automatic classification of documents.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Lucie Training Dataset Card
The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Here are a few use cases for this project:
"Digital Archiving": This model can be very useful in digital archiving libraries where large numbers of newspaper or magazines need to be categorized based on columns, rows, or articles. This drastically simplifies the indexing process, allowing users to easily find desired information.
"Academic Research": Researchers examining historical newspapers or periodicals can use the tool for better extracting and organizing relevant data. By identifying column, row and article layouts, they can more efficiently sift through vast amounts of textual content.
"News Aggregation Services": Online platforms that amass articles from various periodicals can use this computer vision model to automate the classification of their article database, making news discovery faster and more user-friendly for readers.
"Forensic Journalism": Professionally examining old newspapers or articles for investigative reporting will be streamlined with this model. It may easily identify content sequence or similar layouts to uncover hidden connections or patterns among articles.
"Layout Design Training": Designers or students learning about page layout could use this model to study patterns or trends in column, row, or article structure across a wide range of publications.
Facebook
TwitterThis dataset contains references to newspaper articles relating to what is now described as child sexual abuse 1918-1970 that have been collected through keywords searches of British newspapers that are available in digitised form. The dataset was created as part of the ESRC-funded project ES/M009750/1 ‘Historicising “historical child sexual abuse” cases: social, political and criminal justice contexts’. The purpose of this specific element of the project was to identify pattern sin newspaper coverage across time.
The historical sexual abuse of children has become a central focal point of political, social and legal concern. On 7 July 2014 Home Secretary Theresa May announced a public inquiry into how complaints of sexual abuse have been dealt with by public bodies over the last 40 years; the inquiry will produce an interim report by May 2015, with a full report to follow at a later stage. A 10-week investigation has also been launched into allegations relating to Whitehall politicians. These announcements follow the NHS and Department of Health Investigations into Matters Relating to Jimmy Savile (published on 26 June 2014); a second report is due in 2015. The enquiries will hear important evidence from witnesses and examine files associated with the bodies under scrutiny. As yet, however, our knowledge of the broader history of sexual abuse in the twentieth century is extremely partial, with some incidents well charted and others ignored. A full understanding of the wider historical circumstances that have shaped social, legal and political responses to child sexual abuse (or their lack) is urgently needed to provide missing information to contextualise and complement these public inquiries.
This research project will carry out rapid deck-based research, using very significant sets of online sources that are already available in digital form, but whose potential for research into the history of child sexual abuse has not been realised. It will cover four significant areas:
We will construct quantitative profiles of the extent of the reporting and convictions of sexual offences from 1918 to 1990, making use of the published Criminal Justice Statistics for England and Wales.
We will carry out a qualitative longitudinal study of the role of the national and local newspaper press in reporting cases of child sexual abuse, and in shaping social attitudes towards young people and sexuality in the period 1918-1990. The newspaper press was a crucial arena through which public opinion was shaped and shifting moralities were discussed and debated for much of the twentieth century. Whilst the press cannot be viewed as an unproblematic barometer of opinion, it provides historians with an important lens through which to access a range of viewpoints and to chart dominant tropes and narratives. A survey of the newspaper press also enables us to access reports of the decisions that were made in the court-room and thus to further explain the trends for reporting and conviction that analysis of the criminal justice statistics reveal.
We will examine the shifting viewpoints of key professional groups, including social workers and lawyers, by undertaking a survey of publications associated with these occupational groups.
We will begin a mapping of organisations, bodies and associations who have commented on and campaigned around issues relating to children and sexuality across the broad period 1918-1990. This initial mapping will involve research into the availability of archival and manuscripts sources (including those held in the National Archives and local repositories) and will form the basis of a further funding application.
Our time-table is designed to coincide with the undertaking of the public enquiries and the preparation of the further report relating to the NHS and Department of Health Investigations. We will run seminars/workshops for civil servants, lawyers and other professionals involved in these investigations, and make our findings available in a free and easily accessible format as briefings on the History & Policy website. Thus our project will provide essential knowledge to shape discussion, debate, and inform the final public inquiry reports.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Digital Archiving: Libraries, museums or any organization maintaining a collection of old newspapers can use this model for digitizing their archive. By recognizing articles, headlines, and other key features, they can create searchable digital records of historical newspapers.
Media Monitoring: Companies, PR agencies, or researchers can use the model to track coverage of specific topics or entities. For instance, they can analyse how often certain topics are covered, who talks about them, and what is being said.
Educational Tools: Educators can use the model in teaching resources like history, journalism, or literature classes. This could help students explore certain eras or events by quickly analyzing articles from historic newspapers related to those periods.
News Trend Analysis: The model could be used by news analysts or data scientists to understand trends in news coverage over time. For instance, one could analyze coverage of climate change over the years to identify shifts in attitudes or language.
Accessibility: For people with visual impairments or reading disabilities, the model can assist in converting visual newspaper data into a format they can engage with, like text-to-speech software, enhancing their accessibility to such resources.