33 datasets found

Text Classification labeled and unlabeled datasets
kaggle.com
zip
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Jazayeri (2024). Text Classification labeled and unlabeled datasets [Dataset]. https://www.kaggle.com/datasets/annajazayeri/text-classification-labeled-and-unlabeled-datasets/suggestions
Explore at:
zip(27499 bytes)Available download formats
Dataset updated
Jan 7, 2024
Authors
Anna Jazayeri
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Anna Jazayeri

Released under MIT

Contents
Text classification
kaggle.com
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fairea -san (2024). Text classification [Dataset]. https://www.kaggle.com/datasets/faireasan/text-classification/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fairea -san
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Fairea -san

Released under MIT

Contents
Amazon product reviews (mock dataset)
zenodo.org
csv
Updated Jun 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yury Kashnitsky; Yury Kashnitsky (2022). Amazon product reviews (mock dataset) [Dataset]. http://doi.org/10.5281/zenodo.6657410
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6657410
Dataset updated
Jun 18, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yury Kashnitsky; Yury Kashnitsky
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
About

This is a mock dataset with Amazon product reviews. Classes are structured: 6 "level 1" classes, 64 "level 2" classes, and 510 "level 3" classes.

3 files are shared:

train_40k.csv - training 40k Amazon product reviews

valid_10k.csv - 10k reviews left for validation

unlabeled_150k.csv - raw 150k Amazon product reviews, these can be used for language model finetuning.

Level 1 classes are: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.

Dataset originally from https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification
text_classification
kaggle.com
zip
Updated May 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaWenhui (2024). text_classification [Dataset]. https://www.kaggle.com/datasets/mawenhui/text-classification
Explore at:
zip(29752912 bytes)Available download formats
Dataset updated
May 3, 2024
Authors
MaWenhui
Description
Dataset

This dataset was created by MaWenhui

Contents
Text classification-Heathcare
kaggle.com
zip
Updated Dec 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shwet Prakash (2017). Text classification-Heathcare [Dataset]. https://www.kaggle.com/datasets/shwetp/text-classificationheathcare/data
Explore at:
zip(14291782 bytes)Available download formats
Dataset updated
Dec 31, 2017
Authors
Shwet Prakash
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Shwet Prakash

Released under CC0: Public Domain

Contents
SAT Questions and Answers for LLM 🏛️
kaggle.com
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). SAT Questions and Answers for LLM 🏛️ [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/sat-history-questions-and-answers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
SAT History Questions and Answers 🏛️ - Text Classification Dataset

This dataset contains a collection of questions and answers for the SAT Subject Test in World History and US History. Each question is accompanied by a corresponding answers and the correct response.

The dataset includes questions from various topics, time periods, and regions on both World History and US History.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

OTHER DATASETS FOR THE TEXT ANALYSIS:

Google Play Messengers - 6,000 Reviews ⭐️

20,000 Customers Reviews on Banks ⭐️

Amazon Reviews Dataset

Content

For each question, we extracted: - id: number of the question, - subject: SAT subject (World History or US History), - prompt: text of the question, - A: answer A, - B: answer B, - C: answer C, - D: answer D, - E: answer E, - answer: letter of the correct answer to the question

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: answer questions, sat, gpa, university, school, exam, college, web scraping, parsing, online database, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data, machine learning
P
MNAD Dataset
paperswithcode.com
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
Explore at:
Dataset updated
May 16, 2023
Description
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

Dataset Fields

Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

Citation If you use our data, please cite the following paper:

bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
Tweet_Classification
kaggle.com
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shantanu (2024). Tweet_Classification [Dataset]. https://www.kaggle.com/datasets/darkknigh88/tweet-classification/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shantanu
Description
Dataset

This dataset was created by Shantanu

Contents
f
Performance comparison of LastBERT, DistilBERT, and ClinicalBERT on ADHD...
plos.figshare.com
xls
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Performance comparison of LastBERT, DistilBERT, and ClinicalBERT on ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.t004
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of LastBERT, DistilBERT, and ClinicalBERT on ADHD dataset.
spam and ham dataset
kaggle.com
zip
Updated Jun 22, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rushikesh Shivaji Patil (2019). spam and ham dataset [Dataset]. https://www.kaggle.com/rushirdx/spam-and-ham-dataset
Explore at:
zip(215934 bytes)Available download formats
Dataset updated
Jun 22, 2019
Authors
Rushikesh Shivaji Patil
Description
Dataset

This dataset was created by Rushikesh Shivaji Patil

Contents
f
Configuration of the LastBERT model.
plos.figshare.com
xls
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Configuration of the LastBERT model. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.t001
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Tensorflow Official Text Datasets
kaggle.com
zip
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moore (2021). Tensorflow Official Text Datasets [Dataset]. https://www.kaggle.com/imoore/tensorflow-official-text-datasets
Explore at:
zip(143196 bytes)Available download formats
Dataset updated
Jul 6, 2021
Authors
Moore
Description
TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

source

https://www.tensorflow.org/datasets/overview
CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection
zenodo.org
data.niaid.nih.gov
Updated Jan 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahi Gautam Kishore; Struß Julia Maria; Thomas Mandl; Shahi Gautam Kishore; Struß Julia Maria; Thomas Mandl (2022). CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.5775508
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5775508
Dataset updated
Jan 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shahi Gautam Kishore; Struß Julia Maria; Thomas Mandl; Shahi Gautam Kishore; Struß Julia Maria; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

Task 3

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: Coming soon

Related Work

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
h
Language_Indentification_v2
huggingface.co
Updated Mar 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ProcessVenue (2025). Language_Indentification_v2 [Dataset]. https://huggingface.co/datasets/Process-Venue/Language_Indentification_v2
Explore at:
Dataset updated
Mar 18, 2025
Dataset authored and provided by
ProcessVenue
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Language Identification Dataset

Sample Notebook:

https://www.kaggle.com/code/rishabhbhartiya/indian-language-classification-smote-resampled

Kaggle Dataset link:

https://www.kaggle.com/datasets/processvenue/indian-language-identification

Dataset Summary

A comprehensive dataset for Indian language identification and text classification. The dataset contains text samples across 18 major Indian languages, making it suitable for… See the full description on the dataset page: https://huggingface.co/datasets/Process-Venue/Language_Indentification_v2.
Naim Mhedhbi Tunisian Dialect Corpus v1
kaggle.com
zip
Updated Jan 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naim Mhedhbi (2021). Naim Mhedhbi Tunisian Dialect Corpus v1 [Dataset]. https://www.kaggle.com/naim99/tunisian-texts
Explore at:
zip(8185122 bytes)Available download formats
Dataset updated
Jan 12, 2021
Authors
Naim Mhedhbi
Area covered
Tunisia
Description
Dataset

This dataset was created by Naim Mhedhbi

Released under Data files © Original Authors

Contents
PORT.hu reviews
kaggle.com
zip
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balázs Csomor (2021). PORT.hu reviews [Dataset]. https://www.kaggle.com/datasets/csomorbalazs/porthu-reviews
Explore at:
zip(33591896 bytes)Available download formats
Dataset updated
Oct 29, 2021
Authors
Balázs Csomor
Description
Dataset

This dataset was created by Balázs Csomor

Contents
BBC datasets for sentiment analysis
kaggle.com
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alan Turner
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Name: BBC Articles Sentiment Analysis Dataset

Source: BBC News

Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.
Indonesia False News(Hoax) Dataset
kaggle.com
zip
Updated Dec 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Ghazi Muharam (2020). Indonesia False News(Hoax) Dataset [Dataset]. https://www.kaggle.com/muhammadghazimuharam/indonesiafalsenews
Explore at:
zip(560920 bytes)Available download formats
Dataset updated
Dec 11, 2020
Authors
Muhammad Ghazi Muharam
Description
Dataset

This dataset was created by Muhammad Ghazi Muharam

Contents
Meta-Banking Adoption
kaggle.com
data.mendeley.com
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jocelyn Dumlao (2025). Meta-Banking Adoption [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/meta-banking-adoption
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jocelyn Dumlao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains survey responses from 312 participants with AR/VR experience, collected via social media in 2024, exploring factors influencing perceived value and adoption intention of meta-banking services.

Categories

Banking, Marketing

Facebook

Twitter

Click to copy link

Link copied

Cite

Anna Jazayeri (2024). Text Classification labeled and unlabeled datasets [Dataset]. https://www.kaggle.com/datasets/annajazayeri/text-classification-labeled-and-unlabeled-datasets/suggestions

Text Classification labeled and unlabeled datasets

Explore at:

zip(27499 bytes)Available download formats

Dataset updated

Jan 7, 2024

Authors

Anna Jazayeri

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Anna Jazayeri

Released under MIT

Clear search

Close search

Google apps

Main menu

Text Classification labeled and unlabeled datasets

Dataset

Contents

Text classification

Dataset

Contents

Amazon product reviews (mock dataset)

text_classification

Dataset

Contents

Text classification-Heathcare

Dataset

Contents

SAT Questions and Answers for LLM 🏛️

SAT History Questions and Answers 🏛️ - Text Classification Dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

OTHER DATASETS FOR THE TEXT ANALYSIS:

Content

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

MNAD Dataset

CT-FAN-21 corpus: A dataset for Fake News Detection

Tweet_Classification

Dataset

Contents

Performance comparison of LastBERT, DistilBERT, and ClinicalBERT on ADHD...

spam and ham dataset

Dataset

Contents

Configuration of the LastBERT model.

Tensorflow Official Text Datasets

source

CT-FAN-22 corpus: A Multilingual dataset for Fake News Detection

Language_Indentification_v2

Naim Mhedhbi Tunisian Dialect Corpus v1

Dataset

Contents

PORT.hu reviews

Dataset

Contents

BBC datasets for sentiment analysis

Indonesia False News(Hoax) Dataset

Dataset

Contents

Meta-Banking Adoption

Categories

Text Classification labeled and unlabeled datasets

Dataset

Contents