100+ datasets found
  1. Text Document Classification Dataset

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
    Explore at:
    zip(1941393 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    sunil thite
    Description

    This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

    About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

    Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

    1. Politics = 0
    2. Sport = 1
    3. Technology = 2
    4. Entertainment =3
    5. Business = 4
  2. Ecommerce Text Classification

    • kaggle.com
    zip
    Updated Oct 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2023). Ecommerce Text Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
    Explore at:
    zip(8236809 bytes)Available download formats
    Dataset updated
    Oct 9, 2023
    Authors
    Saurabh Shahane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

    The dataset is in ".csv" format with two columns - the first column is the class name and the second one is the datapoint of that class. The data point is the product and description from the e-commerce website.

    The dataset has the following features :

    Data Set Characteristics: Multivariate

    Number of Instances: 50425

    Number of classes: 4

    Area: Computer science

    Attribute Characteristics: Real

    Number of Attributes: 1

    Associated Tasks: Classification

    Missing Values? No

    Gautam. (2019). E commerce text dataset (version - 2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3355823

  3. PubMed MultiLabel Text Classification Dataset MeSH

    • kaggle.com
    zip
    Updated Jul 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2022). PubMed MultiLabel Text Classification Dataset MeSH [Dataset]. https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification
    Explore at:
    zip(47291350 bytes)Available download formats
    Dataset updated
    Jul 2, 2022
    Authors
    Owais Ahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset consists of an approx 50k collection of research articles from PubMed repository. Originally these documents are manually annotated by Biomedical Experts with their MeSH labels and each article are described in terms of 10-15 MeSH labels. In this Dataset we have huge numbers of labels present as a MeSH major, raising the issue of extremely large output space and severe label sparsity issues. To solve this issue, the Dataset has been Processed and mapped to its root as described below. https://gitlab.com/Owaiskhan9654/Gene-Sequence-Primer/-/raw/main/Capture111.PNG" alt="Mapped Image not Fetched"> https://gitlab.com/Owaiskhan9654/Gene-Sequence-Primer/-/raw/main/Capture22.PNG" alt="Tree Structure">

  4. Legal Citation Text Classification

    • kaggle.com
    zip
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Bansal (2021). Legal Citation Text Classification [Dataset]. https://www.kaggle.com/datasets/shivamb/legal-citation-text-classification
    Explore at:
    zip(15646328 bytes)Available download formats
    Dataset updated
    Nov 11, 2021
    Authors
    Shivam Bansal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains Australian legal cases from the Federal Court of Australia (FCA). The cases were downloaded from AustLII. All cases from the year 2006,2007,2008 and 2009 are included. For each document , catchphrases, citations sentences, citation catchphrases, and citation classes are captured. Citation classes are indicated in the document, and indicate the type of treatment given to the cases cited by the present case.

    Exploration Ideas

    • Create a model to perform text classification on legal data
    • EDA to identify top keywords related to every type of case category

    Acknowledgements

    Credits: Filippo Galgani galganif '@' cse.unsw.edu.au School of Computer Science and Engineering The Univeristy of New South Wales, Australia

  5. Facebook Text classification

    • kaggle.com
    zip
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kalashnikov1405 (2025). Facebook Text classification [Dataset]. https://www.kaggle.com/datasets/kalashnikov1405/facebook-text-classification
    Explore at:
    zip(110384 bytes)Available download formats
    Dataset updated
    Aug 23, 2025
    Authors
    kalashnikov1405
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Facebook Text Classification Dataset consists of 5,000 social media posts designed for text analytics and machine learning applications. Each entry represents a Facebook post enriched with attributes such as post content, timestamp, language, engagement metrics, and labels for category, sentiment, and spam detection. The dataset covers ten diverse categories, including personal updates, news, events, promotions, memes, sports, politics, and health-related content, making it suitable for multi-class classification tasks. Sentiment labels (positive, neutral, negative) enable sentiment analysis, while the is_spam field supports spam detection models. Engagement features such as likes, comments, and shares allow exploration of user interaction patterns and predictive modeling of content popularity. With multilingual posts in English, Hindi, Spanish, French, and German, the dataset is ideal for NLP research, including topic classification, polarity detection, engagement forecasting, and multilingual processing, making it a versatile resource for social media analytics.

  6. Medical Text Classification Dataset

    • kaggle.com
    zip
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IceTea (2025). Medical Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/adityak12/medical-text-classification-dataset
    Explore at:
    zip(23076 bytes)Available download formats
    Dataset updated
    May 13, 2025
    Authors
    IceTea
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by IceTea

    Released under CC0: Public Domain

    Contents

  7. Text classification-Heathcare

    • kaggle.com
    zip
    Updated Dec 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shwet Prakash (2017). Text classification-Heathcare [Dataset]. https://www.kaggle.com/datasets/shwetp/text-classificationheathcare
    Explore at:
    zip(14291782 bytes)Available download formats
    Dataset updated
    Dec 31, 2017
    Authors
    Shwet Prakash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Shwet Prakash

    Released under CC0: Public Domain

    Contents

  8. Text Classification

    • kaggle.com
    zip
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    COMSYS (2024). Text Classification [Dataset]. https://www.kaggle.com/datasets/comsys/text-classification
    Explore at:
    zip(22541 bytes)Available download formats
    Dataset updated
    Jan 4, 2024
    Authors
    COMSYS
    Description

    If you use this dataset, please cite it as follows: Paul, A., Mittal, O., Ghosh, S., Dasgupta, S., Bhattacharjee, D., Sarkar, R. (2024). COMSYS Hackathon-1 2023: Igniting Machine Learning Marvels. In: Kole, D.K., Roy Chowdhury, S., Basu, S., Plewczynski, D., Bhattacharjee, D. (eds) Proceedings of 4th International Conference on Frontiers in Computing and Systems. COMSYS 2023. Lecture Notes in Networks and Systems, vol 974. Springer, Singapore. https://doi.org/10.1007/978-981-97-2611-0_29

  9. BBC Full Text Document Classification

    • kaggle.com
    zip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Al Fath Terry (2024). BBC Full Text Document Classification [Dataset]. https://www.kaggle.com/datasets/alfathterry/bbc-full-text-document-classification
    Explore at:
    zip(1929885 bytes)Available download formats
    Dataset updated
    Apr 4, 2024
    Authors
    Al Fath Terry
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    this is the csv and clean version of this dataset link_to_the_original_Data. You can use this data to train your NLP skills.

  10. Text Classification - Supervised Learning

    • kaggle.com
    zip
    Updated Jul 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subba Reddy Jinugu (2018). Text Classification - Supervised Learning [Dataset]. https://www.kaggle.com/datasets/jsreddy79/text-classification-supervised-learning
    Explore at:
    zip(14291782 bytes)Available download formats
    Dataset updated
    Jul 20, 2018
    Authors
    Subba Reddy Jinugu
    Description

    Dataset

    This dataset was created by Subba Reddy Jinugu

    Contents

  11. News Topic Classification

    • kaggle.com
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vrinda Kallu (2024). News Topic Classification [Dataset]. https://www.kaggle.com/datasets/vrindakallu/ag-news-topic-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vrinda Kallu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. More information, can be found using this link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    Content

    These datasets consist of news article headlines. These headlines are labelled as either 0, 1, 2 and 3, these values correspond to 4 types of news topics which are 'World', 'Sports', 'Business' and 'Sci/Tech'.

    Acknowledgements

    I installed the AG's news topic classification training dataset which is available from the huggingface datasets library. The AG's news topic classification training dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the AG's corpus of news articles. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

  12. Text Classification Datasets

    • kaggle.com
    zip
    Updated Aug 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabio Fontana (2023). Text Classification Datasets [Dataset]. https://www.kaggle.com/datasets/fabiofontana97/text-classification-datasets
    Explore at:
    zip(481783 bytes)Available download formats
    Dataset updated
    Aug 15, 2023
    Authors
    Fabio Fontana
    Description

    Dataset

    This dataset was created by Fabio Fontana

    Contents

  13. Email Spam Text Classification Dataset

    • kaggle.com
    zip
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUCEV ROMAN (2023). Email Spam Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/tapakah68/email-spam-classification
    Explore at:
    zip(30878 bytes)Available download formats
    Dataset updated
    Aug 1, 2023
    Authors
    KUCEV ROMAN
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Email Spam Classification, Text Classification Dataset

    The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on roman@kucev.com to buy the dataset

    The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines, excessive use of advertisements, unauthorized links, or attempts to collect personal information.

    The non-spam emails in the dataset are genuine and legitimate messages sent by individuals or organizations. They may include personal or professional communication, newsletters, transaction receipts, or any other non-malicious content.

    The dataset encompasses emails of varying lengths, languages, and writing styles, reflecting the inherent heterogeneity of email communication. This diversity aids in training algorithms that can generalize well to different types of emails, making them robust against different spammer tactics and variations in non-spam email content.

    The dataset's possible applications:

    • spam detection
    • fraud detection
    • email filtering systems
    • customer support automation
    • natural language processing

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F618942%2F4d1fdedb2827152696dd0c0af05fd8da%2Ff.png?generation=1690286497115141&alt=media" alt="">

    💴 Buy the Dataset: This is just an example of the data. Leave a request on roman@kucev.com to discuss your requirements, learn about the price and buy the dataset.

    File with the extension .csv

    includes the following information:

    • title: title of the email,
    • text: text of the email,
    • type: type of the email

    Email spam might be collected in accordance with your requirements.

    keywords: spam mails dataset, email spam classification, spam or not-spam, spam e-mail database, spam detection system, email spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data

  14. Text Classification

    • kaggle.com
    zip
    Updated Jul 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    𝔄ℌ𝔐𝔈𝔇 𝔄𝔖ℌℜ𝔄𝔉 (2023). Text Classification [Dataset]. https://www.kaggle.com/datasets/ahmedashrafahmed/text-classification
    Explore at:
    zip(1921881 bytes)Available download formats
    Dataset updated
    Jul 14, 2023
    Authors
    𝔄ℌ𝔐𝔈𝔇 𝔄𝔖ℌℜ𝔄𝔉
    Description

    Dataset

    This dataset was created by 𝔄ℌ𝔐𝔈𝔇 𝔄𝔖ℌℜ𝔄𝔉

    Contents

  15. Text Classification labeled and unlabeled datasets

    • kaggle.com
    zip
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Jazayeri (2024). Text Classification labeled and unlabeled datasets [Dataset]. https://www.kaggle.com/datasets/annajazayeri/text-classification-labeled-and-unlabeled-datasets
    Explore at:
    zip(27499 bytes)Available download formats
    Dataset updated
    Jan 7, 2024
    Authors
    Anna Jazayeri
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Anna Jazayeri

    Released under MIT

    Contents

  16. Bangla News text classification dataset

    • kaggle.com
    zip
    Updated Sep 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjia Ahmed (2021). Bangla News text classification dataset [Dataset]. https://www.kaggle.com/marjiaahmed/bangla-news-text-classification-dataset
    Explore at:
    zip(111445 bytes)Available download formats
    Dataset updated
    Sep 4, 2021
    Authors
    Marjia Ahmed
    Description

    Dataset

    This dataset was created by Marjia Ahmed

    Contents

  17. Fake News Classification

    • kaggle.com
    zip
    Updated Oct 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2023). Fake News Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
    Explore at:
    zip(96615040 bytes)Available download formats
    Dataset updated
    Oct 8, 2023
    Authors
    Saurabh Shahane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    Published in: IEEE Transactions on Computational Social Systems: pp. 1-13 (doi: 10.1109/TCSS.2021.3068519).

  18. StackOverflow_text_classification

    • kaggle.com
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YAZAN ALSHUAIBI (2025). StackOverflow_text_classification [Dataset]. https://www.kaggle.com/datasets/yazanalshuaibi/stackoverflow-text-classification
    Explore at:
    zip(5304468 bytes)Available download formats
    Dataset updated
    Jul 7, 2025
    Authors
    YAZAN ALSHUAIBI
    Description

    Dataset

    This dataset was created by YAZAN ALSHUAIBI

    Contents

  19. Spanish News Classification

    • kaggle.com
    zip
    Updated Nov 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Morgado (2022). Spanish News Classification [Dataset]. https://www.kaggle.com/datasets/kevinmorgado/spanish-news-classification
    Explore at:
    zip(1443296 bytes)Available download formats
    Dataset updated
    Nov 26, 2022
    Authors
    Kevin Morgado
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was built with a web scraping tool for the Dataton 2022 of Bancolombia for training supervised models to use in a News recommendation of the following categories:

    1. Macroeconomics
    2. Sustainability
    3. Innovation
    4. Regulations
    5. Alliances
    6. Reputation
    7. Other

    Columns

    This CSV document consists of the following columns:

    1. Url: source of the information, but it could be unavailable in some months
    2. News: text of the news - used for classification
    3. Type: Label of type of news.
  20. AI vs Human Text Classification Dataset

    • kaggle.com
    zip
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gulfan (2025). AI vs Human Text Classification Dataset [Dataset]. https://www.kaggle.com/gulfan/ai-vs-human-text-classification-dataset
    Explore at:
    zip(41560 bytes)Available download formats
    Dataset updated
    Sep 4, 2025
    Authors
    gulfan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains text samples labeled as either human-written or AI-generated. It is designed for binary text classification tasks in Natural Language Processing (NLP). The dataset includes 1299 text samples with accompanying basic features such as word count, character count, average word length, and punctuation density.

    The AI-generated texts were collected from multiple LLMs (e.g., ChatGPT, Gemini, Claude). Exact model attribution for each sample is not preserved.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
Organization logo

Text Document Classification Dataset

Text Document Classification Dataset for Classification and Clustering

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(1941393 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
sunil thite
Description

This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

  1. Politics = 0
  2. Sport = 1
  3. Technology = 2
  4. Entertainment =3
  5. Business = 4
Search
Clear search
Close search
Google apps
Main menu