Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Reuters-21578 dataset, one of the most influential and widely used collections of newswire articles from the Reuters financial newswire service, is an essential benchmark for text categorization research. This extensive repository provides a range of valuable insight into topics frequently covered by financial publications and is available in multiple splits for optimal machine learning exploration.
Within this dataset, users will find columns with detailed information such as text (the full body of article text), text_type (classifying whether the article was part of the training or test set), topics (what topics are associated with the particular document), lewis_split (which split it belongs to) , cgis_split (split between train and test set given by core group iteration sampling method), places/people/orgs/exchanges mentioned within it, date and title. In addition to these classifications, there are separate files containing Reuters-21578 articles that were not used in specific splits (ModApte_unused.csv & ModLewis_unused.csv). By leveraging this dataset, you can unlock deep understanding into financial news categorization from an abundance of data points across categories - enabling you to build high performing models that provide better accuracy than ever before!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The Reuters-21578 dataset is a great resource for uncovering valuable insights in financial news. With its wide range of topics and data splits, it is well-suited to be used as a benchmark dataset for text categorization research. Here are some tips and tricks on how to get the most out of this dataset:
Familiarize yourself with the columns: Before getting started, make sure to familiarize yourself with all of the columns included in the dataset. This includes understanding what each column means, as well as identifying which are essential for your research project.
Use an appropriate split: Depending on your research goals, you may need to use different training and test sets from those provided in this dataset (ModHayes_train/test or ModLewis_train/test). You can also create custom splits from the unique ‘ModApte_unused’ set contained within this collection if desired.
Explore other methods: While text categorization is often used with this type of data, you may also want to explore other methods that can help uncover useful information such as topic modelling or sentiment analysis.
Leverage related packages: If you’re using Python or R there are some great packages available specifically designed for working with textual data from Reuters-21578 such as sklearn’s reuters21578 module and klabutils’ reutersR package respectively . Both offer helpful features such as vectorizers that let you transform words into feature vectors when implementing ML models such as Naive Bayes or Random Forest classifiers .
5 Tackle low-level preprocessing tasks : Before getting started with building models using ML algorithms , remember that all input data will benefit greatly from being cleaned up first – particularly in terms of removing invalid characters along side any symbols associated with a language other than English; which could severely affect model accuracy! Additionally , performing minor tasks like stopword removal and stemming words into their root form prior to getting underway could help improve overall performance too!
- Automated text classification - Using the data from the Reuters-21578 dataset, machine learning algorithms can be trained to automatically classify and categorize newswire articles into their appropriate topics. This not only saves time, but also ensures reliable results with minimal human intervention.
- Sentiment analysis - By analyzing the sentiment of individual news article in the Reuters-21578 dataset, one could gain valuable insight into how people generally perceive financial news and then use this information to make more informed investing decisions.
- Stock market predictions - By applying data mining techniques on the content of news articles in this dataset, correlations between certain topics or exchanges mentioned in an article and their effects on stock prices can be identified and used for algorithmic trading strategies aimed at predicting short term stock price movements accurately
If you use this dataset in your research, please credit the orig...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains the Reuters-21578 text categorization collection, a widely used benchmark for text classification tasks. The data consists of news articles from the Reuters newswire in 1987, categorized into various topics. This upload provides the dataset in its raw Standard Generalized Markup Language (SGML) format, allowing users maximum flexibility in parsing and preprocessing the text.
Folder Structure:
The main downloaded folder (reuters21578) contains the following files:
all-exchanges-strings.lc.txt: A text file listing all the exchange-related categories present in the dataset.all-orgs-strings.lc.txt: A text file listing all the organization-related categories.all-people-strings.lc.txt: A text file listing all the people-related categories.all-places-strings.lc.txt: A text file listing all the place-related categories.all-topics-strings.lc.txt: Crucially, this file lists all the topic categories used for classifying the news articles. This is the primary set of labels for the text classification task.cat-descriptions_120396.txt: A text file providing descriptions for some of the categories.feldman-cia-worldfactbook-data.txt: This file appears to contain data related to the CIA World Factbook and might not be directly relevant to the Reuters article classification task.lewis.dtd: This file is a Document Type Definition (DTD) file, which defines the structure and rules for the SGML files in the dataset. It's essential for correctly parsing the SGML files.README.txt (within the main folder and potentially within the reuters21578 subfolder): These files contain important information about the dataset, its origin, and usage. Users should definitely read these files to understand the dataset in detail.reut2-000.sgm to reut2-021.sgm (and potentially more): These are the core files of the dataset. Each .sgm file contains multiple Reuters news articles marked up in SGML format. These files include the article text, metadata, and the assigned topic labels.Content of the Data:
The primary data for classification resides within the .sgm files. Each .sgm file contains one or more <REUTERS> blocks, representing individual news articles. Within these blocks, you will find:
<TEXT>: Contains the main body of the news article, often including <TITLE> and <BODY> tags.<TOPICS>: Contains the topic labels assigned to the article, enclosed within <D> tags. An article can have multiple topics.<DATE>: The date of the news article.<LEWISSPLIT>, <CGISPLIT>, <OLDID>, <NEWID>: Metadata related to how the dataset has been split in different research contexts.The all-topics-strings.lc.txt file provides the vocabulary of the topic labels you will be trying to predict.
How to Use This Dataset:
README.txt files to get a comprehensive understanding of the dataset and its conventions.sgmllib, Beautiful Soup with an SGML parser) to process the .sgm files. You will need to understand the lewis.dtd to correctly interpret the SGML structure.<TEXT> tags for each article.<TOPICS> and <D> tags. Be aware that an article can have multiple labels.all-topics-strings.lc.txt file to understand the possible output classes for your classification model.Citation:
Please cite the original source of the Reuters-21578 dataset:
David D. Lewis. Reuters-21578 Text Categorization Test Collection. Distribution 1.0, 1991.
Data Contribution:
Thank you for uploading this raw SGML version of the Reuters-21578 dataset. By providing the data in its original format, you offer the Kaggle community the opportunity to work with the data at its most fundamental level, allowing for diverse approaches to parsing, preprocessing, and feature engineering in text classification tasks.
If you find this description helpful and the dataset well-represented, please consider giving it an upvote after downloading. Your feedback is valuable!
Facebook
TwitterThis dataset was created by Paladugula Lakshmi Snigdha
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Reuters News Articles
An open-source dataset designed for information retrieval and natural language processing tasks.
Abstract
This dataset is the processed version of reuters-21578 dataset.
Reuters-21578 text categorization test collection Distribution 1.0 (v 1.2) 26 September 1997 David D. Lewis AT&T Labs - Research lewis@research.att.com
Profile
The dataset was processed as part of our work on the reuters-search-engine project, where it was my primary… See the full description on the dataset page: https://huggingface.co/datasets/IsmaelMousa/reuters.
Facebook
TwitterMulti-label dataset. A subset of the reuters dataset includes 2000 observations for text classification.
Facebook
TwitterSusant-Achary/reuters-articles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttp://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
Reuters-21578 text categorization test collection
Distribution 1.0
README file (v 1.3)
14 May 2004
David D. Lewis
David D. Lewis Consulting and Ornarose, Inc.
www.daviddlewis.com
I. Introduction
[Note: There's much that could be improved in this document, but given that Reuters-21578 is being superceded by RCV1, I'm not likely to make those improvements myself. Anyone who would like to create a revised version of this document is invited to contact me.]
This README describes Distribution 1.0 of the Reuters-21578 text categorization test collection, a resource for research in information retrieval, machine learning, and other corpus-based research.
II. Copyright & Notification
The copyright for the text of newswire articles and Reuters
annotations in the Reuters-21578 collection resides with Reuters Ltd.
Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free
distribution of this data for research purposes only.
If you publish results based on this data set, please acknowledge
its use, refer to the data set by the name "Reuters-21578,
Distribution 1.0", and inform your readers of the current location of
the data set (see "Availability & Questions").
III. Availability & Questions
The Reuters-21578, Distribution 1.0 test collection is available from http://www.daviddlewis.com/resources/testcollections/reuters21578
Besides this README file, the collection consists of 22 data files, an SGML DTD file describing the data file format, and six files describing the categories used to index the data. (See Sections VI and VII for more details.) Some additional files, which are not part of the collection but have been contributed by other researchers as useful resources are also included. All files are available uncompressed, and in addition a single gzipped Unix tar archive of the entire distribution is available as reuters21578.tar.gz.
The text categorization mailing list, DDLBETA, is a good place to send questions about this collection and other text categorization issues. You may join the list by writing David Lewis at ddlbeta-request@daviddlewis.com.
IV. History & Acknowledgements
The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins, Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen, Monica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987.
In 1990, the documents were made available by Reuters and CGI for research purposes to the Information Retrieval Laboratory (W. Bruce Croft, Director) of the Computer and Information Science Department at the University of Massachusetts at Amherst. Formatting of the documents and production of associated data files was done in 1990 by David D. Lewis and Stephen Harding at the Information Retrieval Laboratory.
Further formatting and data file production was done in 1991 and 1992 by David D. Lewis and Peter Shoemaker at the Center for Information and Language Studies, University of Chicago. This version of the data was made available for anonymous FTP as "Reuters-22173, Distribution 1.0" in January 1993. From 1993 through 1996, Distribution 1.0 was hosted at a succession of FTP sites maintained by the Center for Intelligent Information Retrieval (W. Bruce Croft, Director) of the Computer Science Department at the University of Massachusetts at Amherst.
At the ACM SIGIR '96 conference in August, 1996 a group of text categorization researchers discussed how published results on Reuters-22173 could be made more comparable across studies. It was decided that a new version of collection should be produced with less ambiguous formatting, and including documentation carefully spelling out standard methods of using the collection. The opportunity would also be used to correct a variety of typographical and other errors in the categorization and formatting of the collection.
Steve Finch and David D. Lewis did this cleanup of the collection September through November of 1996, relying heavily on Finch's SGML-tagged version of the collection from an earlier study. One result of the re-examination of the collection was the removal of 595 documents which were exact duplicates (based on identity of timestamps down to the second) of other documents in the collection. The new collection therefore has only 21,578 documents, and thus is called the Reuters-21578 collection. This README describes version 1.0 of this new collection, which we refer to as "Reuters-21578, Distribution 1.0".
In preparing the collection...
Facebook
Twittershashverma05/reuters dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThe Reuters-21578 dataset is a collection of documents containing news articles. Originally, the corpus comprises 10,369 documents and has a vocabulary of 29,930 unique words.
An additional challenge arises when the labels of the training instances are provided by noisy, heterogeneous crowdworkers with unknown qualities. Initially, assuming labels from a perfect source can help in modeling the problem effectively.
Source of data https://paperswithcode.com/dataset/reuters-21578
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset from Kaggle The split is done on the training set using iterative_train_test_split from scikit-multilearn There are the following 90 labels. 'interest', 'groundnut-oil', 'potato', 'palmkernel', 'sun-meal', 'lei', 'cotton-oil', 'sunseed', 'sorghum', 'barley', 'dlr', 'groundnut', 'wpi', 'strategic-metal', 'livestock', 'l-cattle', 'lin-oil', 'gold', 'fuel', 'nzdlr', 'oat', 'soybean', 'hog', 'tin', 'lumber', 'bop', 'soy-oil', 'dfl', 'nkr', 'gas', 'carcass'… See the full description on the dataset page: https://huggingface.co/datasets/KushT/reuters-21578-train-val-test.
Facebook
TwitterResults – Reuters.
Facebook
TwitterThe Thomson Reuters IPSOS Primary Consumer Sentiment Index (PCSI) in Japan measures consumer confidence by aggregating data on personal financial conditions, economic expectations, investment climate, and employment outlook.
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for reuters.com as of January 2026
Facebook
TwitterJukaboo/Reuters dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://sem1.heaventechit.com/company/legal/terms-of-service/https://sem1.heaventechit.com/company/legal/terms-of-service/
reuters.com is ranked #261 in US with 78.33M Traffic. Categories: Finance, Newspapers. Learn more about website traffic, market share, and more!
Facebook
TwitterQuarterly and annual financial metrics, earnings history, and company performance data for Thomson Reuters.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This English corpus is based on the well known Reuters-21578 corpus which contains economic news articles. In particular, we chose 128 articles containing at least one NE. Compared to the News-100 corpus the documents of Reuters-128 are significantly shorter and thus carry a smaller context.
To create the annotation of NEs with URIs, we implemented a supporting judgement tool. . The input for the tool was a subset of more than 150 Reuters-21578 news articles sampled randomly. First, FOX (Ngonga Ngomo et al., 2011) was used for recognizing a first set of NEs. This reduced the amount of work to a feasible portion regarding the size of this dataset. Afterwards, the domain experts corrected the mistakes of FOX manually using the annotation tool. Therefore, the tool highlighted the entities in the texts and added initial URI candidates via simple string matching algorithms. Two scientists determined the correct URI for each named entity manually with an initial voter agreement of 74%. This low initial agreement rate hints towards the difficulty of the disambiguation task. In some cases judges did not agree initially, but came to an agreement shortly after reviewing the cases. While annotating, we left out ticker symbols of companies (e.g., GOOG for Google Inc.), abbreviations and job descriptions be- cause those are always preceded by the full company name respectively a person’s name.
Facebook
TwitterThe workforce of Thomson Reuters declined significantly between 2009 and 2023. In 2023, however, their workforce grew slightly by approximately *** employees.
Thomson Reuters The Thomson Reuters Corporation is a multinational mass media and information company headquartered in Toronto, Canada. Outside of professional circles, the company is perhaps most associated with the provision of unaffiliated news content to media outlets under the Reuters name, including stories and photos for publication in newspapers. When broken down by business line, however, these services constituted a small amount of revenue generated by the company. The majority of revenue was generated by the provision of information services to corporations and governments, covering legal, tax and accounting, and policy-making more broadly. Of these services, the provision of legal information to law firms was their largest source of revenue. Reason for decline in employee numbers As with their employee numbers, the revenue of Thomson Reuters saw a major decline between 2011 and 2018, however has somewhat recovered since then. This decline was primarily due to the sale of the company’s stake in their financial and risk division. Formerly this division comprised a majority of the company’s revenue, with the sharp drop in revenue for 2017 reflecting the removal of this division’s revenue from Thomson Reuter’s balance sheet. Despite this loss of gross revenue, the company’s net income has remained relatively unaffected.
Facebook
TwitterThe revenue of Thomson Reuters with headquarters in Canada amounted to ************* U.S. dollars in 2024. The reported fiscal year ends on December 31.Compared to 2020, this marks an increase of approximately ************* U.S. dollars. The trend from 2020 to 2024 shows, furthermore, that this increase happened continuously.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Reuters-21578 dataset, one of the most influential and widely used collections of newswire articles from the Reuters financial newswire service, is an essential benchmark for text categorization research. This extensive repository provides a range of valuable insight into topics frequently covered by financial publications and is available in multiple splits for optimal machine learning exploration.
Within this dataset, users will find columns with detailed information such as text (the full body of article text), text_type (classifying whether the article was part of the training or test set), topics (what topics are associated with the particular document), lewis_split (which split it belongs to) , cgis_split (split between train and test set given by core group iteration sampling method), places/people/orgs/exchanges mentioned within it, date and title. In addition to these classifications, there are separate files containing Reuters-21578 articles that were not used in specific splits (ModApte_unused.csv & ModLewis_unused.csv). By leveraging this dataset, you can unlock deep understanding into financial news categorization from an abundance of data points across categories - enabling you to build high performing models that provide better accuracy than ever before!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The Reuters-21578 dataset is a great resource for uncovering valuable insights in financial news. With its wide range of topics and data splits, it is well-suited to be used as a benchmark dataset for text categorization research. Here are some tips and tricks on how to get the most out of this dataset:
Familiarize yourself with the columns: Before getting started, make sure to familiarize yourself with all of the columns included in the dataset. This includes understanding what each column means, as well as identifying which are essential for your research project.
Use an appropriate split: Depending on your research goals, you may need to use different training and test sets from those provided in this dataset (ModHayes_train/test or ModLewis_train/test). You can also create custom splits from the unique ‘ModApte_unused’ set contained within this collection if desired.
Explore other methods: While text categorization is often used with this type of data, you may also want to explore other methods that can help uncover useful information such as topic modelling or sentiment analysis.
Leverage related packages: If you’re using Python or R there are some great packages available specifically designed for working with textual data from Reuters-21578 such as sklearn’s reuters21578 module and klabutils’ reutersR package respectively . Both offer helpful features such as vectorizers that let you transform words into feature vectors when implementing ML models such as Naive Bayes or Random Forest classifiers .
5 Tackle low-level preprocessing tasks : Before getting started with building models using ML algorithms , remember that all input data will benefit greatly from being cleaned up first – particularly in terms of removing invalid characters along side any symbols associated with a language other than English; which could severely affect model accuracy! Additionally , performing minor tasks like stopword removal and stemming words into their root form prior to getting underway could help improve overall performance too!
- Automated text classification - Using the data from the Reuters-21578 dataset, machine learning algorithms can be trained to automatically classify and categorize newswire articles into their appropriate topics. This not only saves time, but also ensures reliable results with minimal human intervention.
- Sentiment analysis - By analyzing the sentiment of individual news article in the Reuters-21578 dataset, one could gain valuable insight into how people generally perceive financial news and then use this information to make more informed investing decisions.
- Stock market predictions - By applying data mining techniques on the content of news articles in this dataset, correlations between certain topics or exchanges mentioned in an article and their effects on stock prices can be identified and used for algorithmic trading strategies aimed at predicting short term stock price movements accurately
If you use this dataset in your research, please credit the orig...