Facebook
TwitterThis is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.
About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2
Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.
The dataset is in ".csv" format with two columns - the first column is the class name and the second one is the datapoint of that class. The data point is the product and description from the e-commerce website.
The dataset has the following features :
Data Set Characteristics: Multivariate
Number of Instances: 50425
Number of classes: 4
Area: Computer science
Attribute Characteristics: Real
Number of Attributes: 1
Associated Tasks: Classification
Missing Values? No
Gautam. (2019). E commerce text dataset (version - 2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3355823
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of an approx 50k collection of research articles from PubMed repository. Originally these documents are manually annotated by Biomedical Experts with their MeSH labels and each article are described in terms of 10-15 MeSH labels. In this Dataset we have huge numbers of labels present as a MeSH major, raising the issue of extremely large output space and severe label sparsity issues. To solve this issue, the Dataset has been Processed and mapped to its root as described below.
https://gitlab.com/Owaiskhan9654/Gene-Sequence-Primer/-/raw/main/Capture111.PNG" alt="Mapped Image not Fetched">
https://gitlab.com/Owaiskhan9654/Gene-Sequence-Primer/-/raw/main/Capture22.PNG" alt="Tree Structure">
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains Australian legal cases from the Federal Court of Australia (FCA). The cases were downloaded from AustLII. All cases from the year 2006,2007,2008 and 2009 are included. For each document , catchphrases, citations sentences, citation catchphrases, and citation classes are captured. Citation classes are indicated in the document, and indicate the type of treatment given to the cases cited by the present case.
Credits: Filippo Galgani galganif '@' cse.unsw.edu.au School of Computer Science and Engineering The Univeristy of New South Wales, Australia
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Facebook Text Classification Dataset consists of 5,000 social media posts designed for text analytics and machine learning applications. Each entry represents a Facebook post enriched with attributes such as post content, timestamp, language, engagement metrics, and labels for category, sentiment, and spam detection. The dataset covers ten diverse categories, including personal updates, news, events, promotions, memes, sports, politics, and health-related content, making it suitable for multi-class classification tasks. Sentiment labels (positive, neutral, negative) enable sentiment analysis, while the is_spam field supports spam detection models. Engagement features such as likes, comments, and shares allow exploration of user interaction patterns and predictive modeling of content popularity. With multilingual posts in English, Hindi, Spanish, French, and German, the dataset is ideal for NLP research, including topic classification, polarity detection, engagement forecasting, and multilingual processing, making it a versatile resource for social media analytics.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by IceTea
Released under CC0: Public Domain
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Shwet Prakash
Released under CC0: Public Domain
Facebook
TwitterIf you use this dataset, please cite it as follows: Paul, A., Mittal, O., Ghosh, S., Dasgupta, S., Bhattacharjee, D., Sarkar, R. (2024). COMSYS Hackathon-1 2023: Igniting Machine Learning Marvels. In: Kole, D.K., Roy Chowdhury, S., Basu, S., Plewczynski, D., Bhattacharjee, D. (eds) Proceedings of 4th International Conference on Frontiers in Computing and Systems. COMSYS 2023. Lecture Notes in Networks and Systems, vol 974. Springer, Singapore. https://doi.org/10.1007/978-981-97-2611-0_29
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
this is the csv and clean version of this dataset link_to_the_original_Data. You can use this data to train your NLP skills.
Facebook
TwitterThis dataset was created by Subba Reddy Jinugu
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. More information, can be found using this link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
These datasets consist of news article headlines. These headlines are labelled as either 0, 1, 2 and 3, these values correspond to 4 types of news topics which are 'World', 'Sports', 'Business' and 'Sci/Tech'.
I installed the AG's news topic classification training dataset which is available from the huggingface datasets library. The AG's news topic classification training dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the AG's corpus of news articles. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
Facebook
TwitterThis dataset was created by Fabio Fontana
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of a collection of emails categorized into two major classes: spam and not spam. It is designed to facilitate the development and evaluation of spam detection or email filtering systems.
The spam emails in the dataset are typically unsolicited and unwanted messages that aim to promote products or services, spread malware, or deceive recipients for various malicious purposes. These emails often contain misleading subject lines, excessive use of advertisements, unauthorized links, or attempts to collect personal information.
The non-spam emails in the dataset are genuine and legitimate messages sent by individuals or organizations. They may include personal or professional communication, newsletters, transaction receipts, or any other non-malicious content.
The dataset encompasses emails of varying lengths, languages, and writing styles, reflecting the inherent heterogeneity of email communication. This diversity aids in training algorithms that can generalize well to different types of emails, making them robust against different spammer tactics and variations in non-spam email content.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F618942%2F4d1fdedb2827152696dd0c0af05fd8da%2Ff.png?generation=1690286497115141&alt=media" alt="">
includes the following information:
keywords: spam mails dataset, email spam classification, spam or not-spam, spam e-mail database, spam detection system, email spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data
Facebook
TwitterThis dataset was created by 𝔄ℌ𝔐𝔈𝔇 𝔄𝔖ℌℜ𝔄𝔉
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Anna Jazayeri
Released under MIT
Facebook
TwitterThis dataset was created by Marjia Ahmed
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.
Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).
There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.
Published in: IEEE Transactions on Computational Social Systems: pp. 1-13 (doi: 10.1109/TCSS.2021.3068519).
Facebook
TwitterThis dataset was created by YAZAN ALSHUAIBI
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was built with a web scraping tool for the Dataton 2022 of Bancolombia for training supervised models to use in a News recommendation of the following categories:
This CSV document consists of the following columns:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains text samples labeled as either human-written or AI-generated. It is designed for binary text classification tasks in Natural Language Processing (NLP). The dataset includes 1299 text samples with accompanying basic features such as word count, character count, average word length, and punctuation density.
The AI-generated texts were collected from multiple LLMs (e.g., ChatGPT, Gemini, Claude). Exact model attribution for each sample is not preserved.
Facebook
TwitterThis is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.
About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2
Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4