100+ datasets found
  1. Reuters-21578 (Text Categorization)

    • kaggle.com
    zip
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reuters-21578 (Text Categorization) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-financial-insights-with-the-reuters-2
    Explore at:
    zip(18703298 bytes)Available download formats
    Dataset updated
    Dec 2, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reuters-21578 (Text Categorization)

    Ruters financial newswire service in 1987

    By Huggingface Hub [source]

    About this dataset

    The Reuters-21578 dataset, one of the most influential and widely used collections of newswire articles from the Reuters financial newswire service, is an essential benchmark for text categorization research. This extensive repository provides a range of valuable insight into topics frequently covered by financial publications and is available in multiple splits for optimal machine learning exploration.

    Within this dataset, users will find columns with detailed information such as text (the full body of article text), text_type (classifying whether the article was part of the training or test set), topics (what topics are associated with the particular document), lewis_split (which split it belongs to) , cgis_split (split between train and test set given by core group iteration sampling method), places/people/orgs/exchanges mentioned within it, date and title. In addition to these classifications, there are separate files containing Reuters-21578 articles that were not used in specific splits (ModApte_unused.csv & ModLewis_unused.csv). By leveraging this dataset, you can unlock deep understanding into financial news categorization from an abundance of data points across categories - enabling you to build high performing models that provide better accuracy than ever before!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The Reuters-21578 dataset is a great resource for uncovering valuable insights in financial news. With its wide range of topics and data splits, it is well-suited to be used as a benchmark dataset for text categorization research. Here are some tips and tricks on how to get the most out of this dataset:

    1. Familiarize yourself with the columns: Before getting started, make sure to familiarize yourself with all of the columns included in the dataset. This includes understanding what each column means, as well as identifying which are essential for your research project.

    2. Use an appropriate split: Depending on your research goals, you may need to use different training and test sets from those provided in this dataset (ModHayes_train/test or ModLewis_train/test). You can also create custom splits from the unique ‘ModApte_unused’ set contained within this collection if desired.

    3. Explore other methods: While text categorization is often used with this type of data, you may also want to explore other methods that can help uncover useful information such as topic modelling or sentiment analysis.

    4. Leverage related packages: If you’re using Python or R there are some great packages available specifically designed for working with textual data from Reuters-21578 such as sklearn’s reuters21578 module and klabutils’ reutersR package respectively . Both offer helpful features such as vectorizers that let you transform words into feature vectors when implementing ML models such as Naive Bayes or Random Forest classifiers .

    5 Tackle low-level preprocessing tasks : Before getting started with building models using ML algorithms , remember that all input data will benefit greatly from being cleaned up first – particularly in terms of removing invalid characters along side any symbols associated with a language other than English; which could severely affect model accuracy! Additionally , performing minor tasks like stopword removal and stemming words into their root form prior to getting underway could help improve overall performance too!

    Research Ideas

    • Automated text classification - Using the data from the Reuters-21578 dataset, machine learning algorithms can be trained to automatically classify and categorize newswire articles into their appropriate topics. This not only saves time, but also ensures reliable results with minimal human intervention.
    • Sentiment analysis - By analyzing the sentiment of individual news article in the Reuters-21578 dataset, one could gain valuable insight into how people generally perceive financial news and then use this information to make more informed investing decisions.
    • Stock market predictions - By applying data mining techniques on the content of news articles in this dataset, correlations between certain topics or exchanges mentioned in an article and their effects on stock prices can be identified and used for algorithmic trading strategies aimed at predicting short term stock price movements accurately

    Acknowledgements

    If you use this dataset in your research, please credit the orig...

  2. Reuters newswire classification dataset

    • kaggle.com
    zip
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Reuters newswire classification dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/reuters-newswire-classification-dataset
    Explore at:
    zip(8272593 bytes)Available download formats
    Dataset updated
    May 15, 2025
    Authors
    Ghanshyam Saini
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Reuters newswire classification dataset

    This dataset contains the Reuters-21578 text categorization collection, a widely used benchmark for text classification tasks. The data consists of news articles from the Reuters newswire in 1987, categorized into various topics. This upload provides the dataset in its raw Standard Generalized Markup Language (SGML) format, allowing users maximum flexibility in parsing and preprocessing the text.

    Folder Structure:

    The main downloaded folder (reuters21578) contains the following files:

    • all-exchanges-strings.lc.txt: A text file listing all the exchange-related categories present in the dataset.
    • all-orgs-strings.lc.txt: A text file listing all the organization-related categories.
    • all-people-strings.lc.txt: A text file listing all the people-related categories.
    • all-places-strings.lc.txt: A text file listing all the place-related categories.
    • all-topics-strings.lc.txt: Crucially, this file lists all the topic categories used for classifying the news articles. This is the primary set of labels for the text classification task.
    • cat-descriptions_120396.txt: A text file providing descriptions for some of the categories.
    • feldman-cia-worldfactbook-data.txt: This file appears to contain data related to the CIA World Factbook and might not be directly relevant to the Reuters article classification task.
    • lewis.dtd: This file is a Document Type Definition (DTD) file, which defines the structure and rules for the SGML files in the dataset. It's essential for correctly parsing the SGML files.
    • README.txt (within the main folder and potentially within the reuters21578 subfolder): These files contain important information about the dataset, its origin, and usage. Users should definitely read these files to understand the dataset in detail.
    • reut2-000.sgm to reut2-021.sgm (and potentially more): These are the core files of the dataset. Each .sgm file contains multiple Reuters news articles marked up in SGML format. These files include the article text, metadata, and the assigned topic labels.

    Content of the Data:

    The primary data for classification resides within the .sgm files. Each .sgm file contains one or more <REUTERS> blocks, representing individual news articles. Within these blocks, you will find:

    • <TEXT>: Contains the main body of the news article, often including <TITLE> and <BODY> tags.
    • <TOPICS>: Contains the topic labels assigned to the article, enclosed within <D> tags. An article can have multiple topics.
    • <DATE>: The date of the news article.
    • <LEWISSPLIT>, <CGISPLIT>, <OLDID>, <NEWID>: Metadata related to how the dataset has been split in different research contexts.

    The all-topics-strings.lc.txt file provides the vocabulary of the topic labels you will be trying to predict.

    How to Use This Dataset:

    1. Download the entire folder.
    2. Read the README.txt files to get a comprehensive understanding of the dataset and its conventions.
    3. Utilize an SGML parsing library in Python (e.g., sgmllib, Beautiful Soup with an SGML parser) to process the .sgm files. You will need to understand the lewis.dtd to correctly interpret the SGML structure.
    4. Extract the text content from the <TEXT> tags for each article.
    5. Extract the topic labels from the <TOPICS> and <D> tags. Be aware that an article can have multiple labels.
    6. You will likely need to perform significant data cleaning and preprocessing on the extracted text.
    7. Use the all-topics-strings.lc.txt file to understand the possible output classes for your classification model.
    8. Consider how you want to handle multi-label classification if an article has multiple topics.

    Citation:

    Please cite the original source of the Reuters-21578 dataset:

    David D. Lewis. Reuters-21578 Text Categorization Test Collection. Distribution 1.0, 1991.

    Data Contribution:

    Thank you for uploading this raw SGML version of the Reuters-21578 dataset. By providing the data in its original format, you offer the Kaggle community the opportunity to work with the data at its most fundamental level, allowing for diverse approaches to parsing, preprocessing, and feature engineering in text classification tasks.

    If you find this description helpful and the dataset well-represented, please consider giving it an upvote after downloading. Your feedback is valuable!

  3. reuters-dataset

    • kaggle.com
    zip
    Updated Mar 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paladugula Lakshmi Snigdha (2023). reuters-dataset [Dataset]. https://www.kaggle.com/datasets/snigdhapaladugula/reuters-dataset
    Explore at:
    zip(26560 bytes)Available download formats
    Dataset updated
    Mar 19, 2023
    Authors
    Paladugula Lakshmi Snigdha
    Description

    Dataset

    This dataset was created by Paladugula Lakshmi Snigdha

    Contents

  4. h

    reuters

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ismael, reuters [Dataset]. https://huggingface.co/datasets/IsmaelMousa/reuters
    Explore at:
    Authors
    Ismael
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Reuters News Articles

    An open-source dataset designed for information retrieval and natural language processing tasks.

      Abstract
    

    This dataset is the processed version of reuters-21578 dataset.

    Reuters-21578 text categorization test collection Distribution 1.0 (v 1.2) 26 September 1997 David D. Lewis AT&T Labs - Research lewis@research.att.com

      Profile
    

    The dataset was processed as part of our work on the reuters-search-engine project, where it was my primary… See the full description on the dataset page: https://huggingface.co/datasets/IsmaelMousa/reuters.

  5. o

    reuters

    • openml.org
    • api.openml.org
    Updated Feb 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). reuters [Dataset]. https://www.openml.org/d/40594
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2017
    Description

    Multi-label dataset. A subset of the reuters dataset includes 2000 observations for text classification.

  6. h

    reuters-articles

    • huggingface.co
    Updated Sep 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susant Achary (2024). reuters-articles [Dataset]. https://huggingface.co/datasets/Susant-Achary/reuters-articles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2024
    Authors
    Susant Achary
    Description

    Susant-Achary/reuters-articles dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Dataset Reuters newswire topics in keras

    • kaggle.com
    zip
    Updated Aug 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nor (2021). Dataset Reuters newswire topics in keras [Dataset]. https://www.kaggle.com/wordroid/dataset-reuters-newswire-topics-in-keras
    Explore at:
    zip(2340745 bytes)Available download formats
    Dataset updated
    Aug 4, 2021
    Authors
    nor
    Description

    http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt

       Reuters-21578 text categorization test collection
                Distribution 1.0
               README file (v 1.3)
                 14 May 2004
    
                David D. Lewis
          David D. Lewis Consulting and Ornarose, Inc. 
               www.daviddlewis.com
    

    I. Introduction

    [Note: There's much that could be improved in this document, but given that Reuters-21578 is being superceded by RCV1, I'm not likely to make those improvements myself. Anyone who would like to create a revised version of this document is invited to contact me.]

    This README describes Distribution 1.0 of the Reuters-21578 text categorization test collection, a resource for research in information retrieval, machine learning, and other corpus-based research.

    II. Copyright & Notification

    The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data for research purposes only.
    If you publish results based on this data set, please acknowledge its use, refer to the data set by the name "Reuters-21578, Distribution 1.0", and inform your readers of the current location of the data set (see "Availability & Questions").

    III. Availability & Questions

    The Reuters-21578, Distribution 1.0 test collection is available from http://www.daviddlewis.com/resources/testcollections/reuters21578

    Besides this README file, the collection consists of 22 data files, an SGML DTD file describing the data file format, and six files describing the categories used to index the data. (See Sections VI and VII for more details.) Some additional files, which are not part of the collection but have been contributed by other researchers as useful resources are also included. All files are available uncompressed, and in addition a single gzipped Unix tar archive of the entire distribution is available as reuters21578.tar.gz.

    The text categorization mailing list, DDLBETA, is a good place to send questions about this collection and other text categorization issues. You may join the list by writing David Lewis at ddlbeta-request@daviddlewis.com.

    IV. History & Acknowledgements

    The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins, Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen, Monica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987.

    In 1990, the documents were made available by Reuters and CGI for research purposes to the Information Retrieval Laboratory (W. Bruce Croft, Director) of the Computer and Information Science Department at the University of Massachusetts at Amherst. Formatting of the documents and production of associated data files was done in 1990 by David D. Lewis and Stephen Harding at the Information Retrieval Laboratory.

    Further formatting and data file production was done in 1991 and 1992 by David D. Lewis and Peter Shoemaker at the Center for Information and Language Studies, University of Chicago. This version of the data was made available for anonymous FTP as "Reuters-22173, Distribution 1.0" in January 1993. From 1993 through 1996, Distribution 1.0 was hosted at a succession of FTP sites maintained by the Center for Intelligent Information Retrieval (W. Bruce Croft, Director) of the Computer Science Department at the University of Massachusetts at Amherst.

    At the ACM SIGIR '96 conference in August, 1996 a group of text categorization researchers discussed how published results on Reuters-22173 could be made more comparable across studies. It was decided that a new version of collection should be produced with less ambiguous formatting, and including documentation carefully spelling out standard methods of using the collection. The opportunity would also be used to correct a variety of typographical and other errors in the categorization and formatting of the collection.

    Steve Finch and David D. Lewis did this cleanup of the collection September through November of 1996, relying heavily on Finch's SGML-tagged version of the collection from an earlier study. One result of the re-examination of the collection was the removal of 595 documents which were exact duplicates (based on identity of timestamps down to the second) of other documents in the collection. The new collection therefore has only 21,578 documents, and thus is called the Reuters-21578 collection. This README describes version 1.0 of this new collection, which we refer to as "Reuters-21578, Distribution 1.0".

    In preparing the collection...

  8. h

    reuters

    • huggingface.co
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashank Verma (2024). reuters [Dataset]. https://huggingface.co/datasets/shashverma05/reuters
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2024
    Authors
    Shashank Verma
    Description

    shashverma05/reuters dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    reuters

    • huggingface.co
    Updated Feb 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yangchao (2001). reuters [Dataset]. https://huggingface.co/datasets/pinglsl/reuters
    Explore at:
    Dataset updated
    Feb 1, 2001
    Authors
    yangchao
    Description

    pinglsl/reuters dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. reuters21578

    • kaggle.com
    zip
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menna Allah Saeed (2024). reuters21578 [Dataset]. https://www.kaggle.com/datasets/mennaallahsaed/reuters21578
    Explore at:
    zip(8255999 bytes)Available download formats
    Dataset updated
    Apr 28, 2024
    Authors
    Menna Allah Saeed
    Description

    The Reuters-21578 dataset is a collection of documents containing news articles. Originally, the corpus comprises 10,369 documents and has a vocabulary of 29,930 unique words.

    An additional challenge arises when the labels of the training instances are provided by noisy, heterogeneous crowdworkers with unknown qualities. Initially, assuming labels from a perfect source can help in modeling the problem effectively.

    Source of data https://paperswithcode.com/dataset/reuters-21578

  11. h

    reuters-21578-train-val-test

    • huggingface.co
    Updated Feb 26, 1987
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kushagri Tandon (1987). reuters-21578-train-val-test [Dataset]. https://huggingface.co/datasets/KushT/reuters-21578-train-val-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 1987
    Authors
    Kushagri Tandon
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset from Kaggle The split is done on the training set using iterative_train_test_split from scikit-multilearn There are the following 90 labels. 'interest', 'groundnut-oil', 'potato', 'palmkernel', 'sun-meal', 'lei', 'cotton-oil', 'sunseed', 'sorghum', 'barley', 'dlr', 'groundnut', 'wpi', 'strategic-metal', 'livestock', 'l-cattle', 'lin-oil', 'gold', 'fuel', 'nzdlr', 'oat', 'soybean', 'hog', 'tin', 'lumber', 'bop', 'soy-oil', 'dfl', 'nkr', 'gas', 'carcass'… See the full description on the dataset page: https://huggingface.co/datasets/KushT/reuters-21578-train-val-test.

  12. f

    Results – Reuters.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Apr 29, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Last, Mark; Howard, Newton; Argamon, Shlomo; Frieder, Ophir; Assaf, Dan; Neuman, Yair; Cohen, Yohai (2013). Results – Reuters. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001735882
    Explore at:
    Dataset updated
    Apr 29, 2013
    Authors
    Last, Mark; Howard, Newton; Argamon, Shlomo; Frieder, Ophir; Assaf, Dan; Neuman, Yair; Cohen, Yohai
    Description

    Results – Reuters.

  13. Thomson Reuters IPSOS PCSI

    • tipranks.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TipRanks, Thomson Reuters IPSOS PCSI [Dataset]. https://www.tipranks.com/calendars/economic/thomson-reuters-ipsos-pcsi-103347
    Explore at:
    Dataset authored and provided by
    TipRankshttp://www.tipranks.com/
    Time period covered
    Jan 9, 2025 - Jan 15, 2026
    Area covered
    jp
    Description

    The Thomson Reuters IPSOS Primary Consumer Sentiment Index (PCSI) in Japan measures consumer confidence by aggregating data on personal financial conditions, economic expectations, investment climate, and employment outlook.

  14. e

    reuters.com Traffic Analytics Data

    • analytics.explodingtopics.com
    Updated Jan 1, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2026). reuters.com Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/reuters.com
    Explore at:
    Dataset updated
    Jan 1, 2026
    Variables measured
    Global Rank, Monthly Visits, Authority Score, US Country Rank, Mass Media Category Rank
    Description

    Traffic analytics, rankings, and competitive metrics for reuters.com as of January 2026

  15. h

    Reuters

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jukabo, Reuters [Dataset]. https://huggingface.co/datasets/Jukaboo/Reuters
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Jukabo
    Description

    Jukaboo/Reuters dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. reuters.com Website Traffic, Ranking, Analytics [February 2026]

    • sem1.heaventechit.com
    Updated Mar 12, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semrush (2026). reuters.com Website Traffic, Ranking, Analytics [February 2026] [Dataset]. https://sem1.heaventechit.com/website/reuters.com/overview/
    Explore at:
    Dataset updated
    Mar 12, 2026
    Dataset authored and provided by
    Semrushhttps://fr.semrush.com/
    License

    https://sem1.heaventechit.com/company/legal/terms-of-service/https://sem1.heaventechit.com/company/legal/terms-of-service/

    Time period covered
    Mar 12, 2026
    Area covered
    Worldwide
    Variables measured
    visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
    Measurement technique
    Semrush Traffic Analytics; Click-stream data
    Description

    reuters.com is ranked #261 in US with 78.33M Traffic. Categories: Finance, Newspapers. Learn more about website traffic, market share, and more!

  17. c

    Thomson Reuters financial metrics and earnings dataset

    • capyfin.com
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CapyFin (2025). Thomson Reuters financial metrics and earnings dataset [Dataset]. https://capyfin.com/s/nasdaq/TRI
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    CapyFin
    Variables measured
    Adjusted EBITDA, Adjusted Earnings, Adjusted EBITDA Margin, Diluted Weighted-Average Shares, Net Cash from Operating Activities
    Description

    Quarterly and annual financial metrics, earnings history, and company performance data for Thomson Reuters.

  18. w

    Reuters-128 NIF NER Corpus

    • data.wu.ac.at
    pdf, ttl
    Updated Oct 29, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AKSW (2014). Reuters-128 NIF NER Corpus [Dataset]. https://data.wu.ac.at/odso/datahub_io/Y2UwODlhOTUtYTgxZC00NTk5LTlkOTgtODE4ZWUwMDAzMjM3
    Explore at:
    ttl, pdfAvailable download formats
    Dataset updated
    Oct 29, 2014
    Dataset provided by
    AKSW
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This English corpus is based on the well known Reuters-21578 corpus which contains economic news articles. In particular, we chose 128 articles containing at least one NE. Compared to the News-100 corpus the documents of Reuters-128 are significantly shorter and thus carry a smaller context.

    To create the annotation of NEs with URIs, we implemented a supporting judgement tool. . The input for the tool was a subset of more than 150 Reuters-21578 news articles sampled randomly. First, FOX (Ngonga Ngomo et al., 2011) was used for recognizing a first set of NEs. This reduced the amount of work to a feasible portion regarding the size of this dataset. Afterwards, the domain experts corrected the mistakes of FOX manually using the annotation tool. Therefore, the tool highlighted the entities in the texts and added initial URI candidates via simple string matching algorithms. Two scientists determined the correct URI for each named entity manually with an initial voter agreement of 74%. This low initial agreement rate hints towards the difficulty of the disambiguation task. In some cases judges did not agree initially, but came to an agreement shortly after reviewing the cases. While annotating, we left out ticker symbols of companies (e.g., GOOG for Google Inc.), abbreviations and job descriptions be- cause those are always preceded by the full company name respectively a person’s name.

  19. Number of Thomson Reuters employees worldwide by region 2009-2023

    • statista.com
    Updated Nov 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of Thomson Reuters employees worldwide by region 2009-2023 [Dataset]. https://www.statista.com/statistics/292546/thomson-reuters-employees-by-region/
    Explore at:
    Dataset updated
    Nov 24, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The workforce of Thomson Reuters declined significantly between 2009 and 2023. In 2023, however, their workforce grew slightly by approximately *** employees.

    Thomson Reuters The Thomson Reuters Corporation is a multinational mass media and information company headquartered in Toronto, Canada. Outside of professional circles, the company is perhaps most associated with the provision of unaffiliated news content to media outlets under the Reuters name, including stories and photos for publication in newspapers. When broken down by business line, however, these services constituted a small amount of revenue generated by the company. The majority of revenue was generated by the provision of information services to corporations and governments, covering legal, tax and accounting, and policy-making more broadly. Of these services, the provision of legal information to law firms was their largest source of revenue. Reason for decline in employee numbers As with their employee numbers, the revenue of Thomson Reuters saw a major decline between 2011 and 2018, however has somewhat recovered since then. This decline was primarily due to the sale of the company’s stake in their financial and risk division. Formerly this division comprised a majority of the company’s revenue, with the sharp drop in revenue for 2017 reflecting the removal of this division’s revenue from Thomson Reuter’s balance sheet. Despite this loss of gross revenue, the company’s net income has remained relatively unaffected.

  20. Thomson Reuters revenue 2020-2024

    • statista.com
    Updated Mar 16, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Thomson Reuters revenue 2020-2024 [Dataset]. https://www.statista.com/statistics/225359/thomson-reuters-revenue/
    Explore at:
    Dataset updated
    Mar 16, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Canada
    Description

    The revenue of Thomson Reuters with headquarters in Canada amounted to ************* U.S. dollars in 2024. The reported fiscal year ends on December 31.Compared to 2020, this marks an increase of approximately ************* U.S. dollars. The trend from 2020 to 2024 shows, furthermore, that this increase happened continuously.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Reuters-21578 (Text Categorization) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-financial-insights-with-the-reuters-2
Organization logo

Reuters-21578 (Text Categorization)

Ruters financial newswire service in 1987

Explore at:
zip(18703298 bytes)Available download formats
Dataset updated
Dec 2, 2022
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Reuters-21578 (Text Categorization)

Ruters financial newswire service in 1987

By Huggingface Hub [source]

About this dataset

The Reuters-21578 dataset, one of the most influential and widely used collections of newswire articles from the Reuters financial newswire service, is an essential benchmark for text categorization research. This extensive repository provides a range of valuable insight into topics frequently covered by financial publications and is available in multiple splits for optimal machine learning exploration.

Within this dataset, users will find columns with detailed information such as text (the full body of article text), text_type (classifying whether the article was part of the training or test set), topics (what topics are associated with the particular document), lewis_split (which split it belongs to) , cgis_split (split between train and test set given by core group iteration sampling method), places/people/orgs/exchanges mentioned within it, date and title. In addition to these classifications, there are separate files containing Reuters-21578 articles that were not used in specific splits (ModApte_unused.csv & ModLewis_unused.csv). By leveraging this dataset, you can unlock deep understanding into financial news categorization from an abundance of data points across categories - enabling you to build high performing models that provide better accuracy than ever before!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

The Reuters-21578 dataset is a great resource for uncovering valuable insights in financial news. With its wide range of topics and data splits, it is well-suited to be used as a benchmark dataset for text categorization research. Here are some tips and tricks on how to get the most out of this dataset:

  1. Familiarize yourself with the columns: Before getting started, make sure to familiarize yourself with all of the columns included in the dataset. This includes understanding what each column means, as well as identifying which are essential for your research project.

  2. Use an appropriate split: Depending on your research goals, you may need to use different training and test sets from those provided in this dataset (ModHayes_train/test or ModLewis_train/test). You can also create custom splits from the unique ‘ModApte_unused’ set contained within this collection if desired.

  3. Explore other methods: While text categorization is often used with this type of data, you may also want to explore other methods that can help uncover useful information such as topic modelling or sentiment analysis.

  4. Leverage related packages: If you’re using Python or R there are some great packages available specifically designed for working with textual data from Reuters-21578 such as sklearn’s reuters21578 module and klabutils’ reutersR package respectively . Both offer helpful features such as vectorizers that let you transform words into feature vectors when implementing ML models such as Naive Bayes or Random Forest classifiers .

5 Tackle low-level preprocessing tasks : Before getting started with building models using ML algorithms , remember that all input data will benefit greatly from being cleaned up first – particularly in terms of removing invalid characters along side any symbols associated with a language other than English; which could severely affect model accuracy! Additionally , performing minor tasks like stopword removal and stemming words into their root form prior to getting underway could help improve overall performance too!

Research Ideas

  • Automated text classification - Using the data from the Reuters-21578 dataset, machine learning algorithms can be trained to automatically classify and categorize newswire articles into their appropriate topics. This not only saves time, but also ensures reliable results with minimal human intervention.
  • Sentiment analysis - By analyzing the sentiment of individual news article in the Reuters-21578 dataset, one could gain valuable insight into how people generally perceive financial news and then use this information to make more informed investing decisions.
  • Stock market predictions - By applying data mining techniques on the content of news articles in this dataset, correlations between certain topics or exchanges mentioned in an article and their effects on stock prices can be identified and used for algorithmic trading strategies aimed at predicting short term stock price movements accurately

Acknowledgements

If you use this dataset in your research, please credit the orig...

Search
Clear search
Close search
Google apps
Main menu