100+ datasets found

LLM - Detect AI Generated Text Dataset
kaggle.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
Text Classification Dataset
opendatabay.com
.undefined
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 6, 2025
Dataset provided by
Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
Authors
Opendatabay
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

Dataset Features

1. text: Contains individual English-language comments or posts sourced from various online platforms.

2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

Distribution

Format: CSV (Comma-Separated Values)

2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)

File Size: Approximately 23.9 MB

Structure: Each row contains a single comment and its corresponding sentiment label.

Usage

This dataset is ideal for a variety of applications:

1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

Coverage

Geographic Coverage: Primarily English-language content from global online platforms

Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

License

CC0

Who Can Use It

Data Scientists: For training machine learning models.

Researchers: For academic or scientific studies.

Businesses: For analysis, insights, or AI development.
i
A collection of nine multi-label text classification datasets
ieee-dataport.org
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
Explore at:
Dataset updated
Nov 4, 2024
Authors
Yiming Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RCV1
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zip(798357692)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
d
Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning &...
datarade.ai
Updated Mar 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kieli (2021). Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning & AI platforms [Dataset]. https://datarade.ai/data-products/a-fully-labelled-dataset-for-machine-learning-and-ai-platforms-kieli
Explore at:
.json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Mar 20, 2021
Dataset authored and provided by
Kieli
Area covered
Djibouti, Fiji, Venezuela (Bolivarian Republic of), Antigua and Barbuda, Denmark, Uruguay, Tajikistan, Anguilla, Mauritius, Ethiopia
Description
Kieli labels audio speech, Image, Video & Text Data including semantic segmentation, named entity recognition (NER) and POS tagging. Kieli transforms unstructured data into high quality training data for the refinement of Artificial Intelligence and Machine Learning platforms. For over a decade, hundreds of organizations have relied on Kieli to deliver secure, high-quality training data and model validation for machine learning. At Kieli, we believe that accurate data is the most important factor in production learning models. We are committed to delivering the best quality data for the most enterprising organizations and helping you make strides in Artificial Intelligence. At Kieli, we're passionately dedicated to serving the Arabic, English and French markets. We work in all areas of industry: healthcare, technology and retail.
P
An Amharic News Text classification Dataset Dataset
paperswithcode.com
Updated Mar 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israel Abebe Azime; Nebil Mohammed (2021). An Amharic News Text classification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/an-amharic-news-text-classification-dataset
Explore at:
Dataset updated
Mar 9, 2021
Authors
Israel Abebe Azime; Nebil Mohammed
Description
In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
m
Balinese Story Texts Dataset - Characters, Aliases, and their Classification...
data.mendeley.com
Updated Mar 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
I Made Satria Bimantara (2024). Balinese Story Texts Dataset - Characters, Aliases, and their Classification [Dataset]. http://doi.org/10.17632/h2tf5ymcp9.3
Explore at:
Unique identifier
https://doi.org/10.17632/h2tf5ymcp9.3
Dataset updated
Mar 25, 2024
Authors
I Made Satria Bimantara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of 120 Balinese story texts (as known as Satua Bali) which have been annotated for narrative text analysis purposes, including character identification, alias clustering, and character classification into protagonist or antagonist. The labeling involved two Balinese native speakers who were fluent in understanding Balinese story texts. One of them is an expert in the fields of sociolinguistics and macrolinguistics. Reliability and level of agreement in the dataset are measured by Cohen's kappa coefficient, Jaccard similarity coefficient, and F1-score and all of them show almost perfect agreement values (>0,81). There are four main folders, each used for different narrative text analysis purposes: 1. First Dataset (charsNamedEntity): 89,917 annotated tokens with five character named entity labels (ANM, ADJ, PNAME, GODS, OBJ) for character named entity recognition purpose 2. Second Dataset (charsExtraction): 6,634 annotated sentences for the purpose of character identification at the sentence level 3. Third Dataset (charsAliasClustering): 930 lists of character groups from 120 story texts for the purpose of alias clustering 4. Fourth Dataset (charsClassification): 848 lists of character groups that have been classified into two groups (Protagonist and Antagonist)
u
LLM Text Generation Dataset
unidata.pro
csv
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata L.L.C-FZ (2025). LLM Text Generation Dataset [Dataset]. https://unidata.pro/datasets/llm-text-generation/
Explore at:
csvAvailable download formats
Dataset updated
Feb 26, 2025
Dataset authored and provided by
Unidata L.L.C-FZ
Description
LLM Text Generation dataset offers multilingual text samples from large language models, enriching AI’s natural language understanding
IPATH Dataset: 45,609 Curated Image-Text Pairs for Histopathology...
zenodo.org
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyederfan Mirhosseini; Seyederfan Mirhosseini; Taran Rai; Pablo Jose Diaz Santana; Roberto La Ragione; Roberto La Ragione; Nicholas Bacon; Nicholas Bacon; Kevin Wells; Kevin Wells; Taran Rai; Pablo Jose Diaz Santana (2025). IPATH Dataset: 45,609 Curated Image-Text Pairs for Histopathology Applications [Dataset]. http://doi.org/10.5281/zenodo.14278846
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14278846
Dataset updated
Apr 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Seyederfan Mirhosseini; Seyederfan Mirhosseini; Taran Rai; Pablo Jose Diaz Santana; Roberto La Ragione; Roberto La Ragione; Nicholas Bacon; Nicholas Bacon; Kevin Wells; Kevin Wells; Taran Rai; Pablo Jose Diaz Santana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent advancements in artificial intelligence (AI) have enabled the identification of patterns in pathology images, improving diagnostic accuracy and decision support systems. However, progress has been limited due to the lack of publicly available medical images. To address this scarcity, we explore Instagram as a novel source of pathology images with expert annotations. We curated the IPATH dataset from Instagram, comprising 45,609 pathology image-text pairs, using a combination of classifiers, large language models, and manual filtering. To demonstrate the value of this dataset, we developed a multimodal AI model called IP-CLIP by fine-tuning the pre-trained CLIP model using the IPATH dataset. IP-CLIP outperforms the original CLIP model in classifying new pathology images on two downstream tasks—zero-shot classification and linear probing—using two external histopathology datasets. These results surpass the CLIP baseline model and demonstrate the effectiveness of the IPATH dataset, highlighting the potential of social media data to advance AI models for medical image classification.
m
Amharic text dataset extracted from memes for hate speech detection or...
data.mendeley.com
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3fdtw5v7.2
Dataset updated
Jun 8, 2023
Authors
Mequanent Degu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.
Date-Dataset
kaggle.com
Updated Aug 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nishtha kukreti (2021). Date-Dataset [Dataset]. https://www.kaggle.com/nishthakukreti/datedataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
nishtha kukreti
Description
Context

This is the random date data-set generated by me using python script to create a Machine Learning model to tag the date in any given document.

Content

This data-set contains whether the given word of word are dates or not

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Implement Machine Learning model or Deep learning Model or train a custom spacy to tag the date and other POS.
I
Trained models for multi-task multi-dataset learning for text classification...
databank.illinois.edu
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra, Trained models for multi-task multi-dataset learning for text classification in tweets [Dataset]. http://doi.org/10.13012/B2IDB-1917934_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-1917934_V1
Authors
Shubhanshu Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trained models for multi-task multi-dataset learning for text classification in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Models were trained using: https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py See https://github.com/socialmediaie/SocialMediaIE and https://socialmediaie.github.io for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
m
A kiswahili Dataset for Development of Text-To-Speech System
data.mendeley.com
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
Explore at:
Unique identifier
https://doi.org/10.17632/vbvj6j6pm9.1
Dataset updated
Nov 30, 2021
Authors
Kiptoo Rono
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
o
Yektanet Persian Web Text Classification Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Yektanet Persian Web Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/886a3949-9499-4647-9038-b7e8caa26cfc
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
The Yektanet Dataset is a real Persian web data collection, meticulously refined and gathered by the Yektanet platform. Its primary purpose is to serve as an industrial case study for applying machine learning in Natural Language Processing (NLP) [1]. This dataset enables the development of machine learning models capable of predicting the categorical topic of a document based on its text features, such as the title, description, and full text content [1]. It provides a valuable resource for training and evaluating machine learning models in document categorisation and topic prediction [1].

Columns

The dataset consists of multiple instances, each containing various features that provide information about the documents [1]. The main target variable is the category column, which indicates the topic or category of the content [1]. Additional features include: * Description: This column provides a description of the document [1]. * Text_content: This column holds the complete text content of the document [1]. * Title: This column represents the title of the document [1]. * h1 and h2: These columns contain content found within the HTML tags h1 and h2, respectively [1]. * URL: This column specifies the link address associated with the document [1]. * Domain: This column indicates the domain or website from which the document originates [1]. * Id: This column represents the unique identifier for each link [1].

Distribution

The Yektanet dataset comprises multiple instances, with approximately 5206 records based on the distribution of category labels [1, 2]. The dataset includes unique values for columns such as ID (4786 unique values), text content (4720 unique values), title (4614 unique values), and description (4399 unique values) [3]. Data files are typically provided in CSV format [4].

Usage

This dataset is ideally suited for developing and evaluating machine learning models for document categorisation and topic prediction tasks [1]. It can be used for applications involving Natural Language Processing (NLP), such as: * Training machine learning models to predict document topics [1]. * Developing text classification systems [1]. * Research into real-world web data analysis [1]. * Exploring feature engineering for NLP tasks [1].

Coverage

The Yektanet dataset is a real Persian web data collection [1]. Its region of coverage is global [5]. It includes content across various topics, with dominant categories such as 'سلامت' (health) at 13% and 'ورزش' (sports) at 11% [3]. The data availability is not restricted to specific groups or years beyond being a current web data collection [1].

License

CC By

Who Can Use It

The dataset is primarily intended for researchers and practitioners in the fields of machine learning and Natural Language Processing (NLP) [1]. Ideal users include data scientists, AI/ML engineers, academics, and anyone interested in document classification, topic modelling, or working with Persian text data [1].

Dataset Name Suggestions

Yektanet Persian Web Text Classification Dataset

Persian Document Topic Prediction Data

Yektanet NLP Classification Corpus

Web Text Categorisation Dataset (Persian)

Yektanet Machine Learning Text Dataset

Attributes

Original Data Source: Yektanet( Dataset for Text Classification)
i
Event-Dataset: Temporal information retrieval and text classification...
ieee-dataport.org
Updated Nov 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Islam (2020). Event-Dataset: Temporal information retrieval and text classification dataset [Dataset]. https://ieee-dataport.org/documents/event-dataset-temporal-information-retrieval-and-text-classification-dataset
Explore at:
Dataset updated
Nov 6, 2020
Authors
Muhammad Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
2018
I
Trained models for multi-task multi-dataset learning for text classification...
databank.illinois.edu
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra (2020). Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets [Dataset]. http://doi.org/10.13012/B2IDB-1094364_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-1094364_V1
Dataset updated
Aug 4, 2020
Authors
Shubhanshu Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py See https://github.com/socialmediaie/SocialMediaIE and https://socialmediaie.github.io for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
WELFake dataset for fake news detection in text data
zenodo.org
csv
Updated Apr 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4561253
Dataset updated
Apr 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
Processed twitter sentiment Dataset | Added Tokens
kaggle.com
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5568348
Dataset updated
Aug 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Halemo GPA
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Machine Learning Materials Datasets
figshare.com
txt
Updated Sep 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Morgan (2018). Machine Learning Materials Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7017254.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7017254.v5
Dataset updated
Sep 11, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Dane Morgan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Three datasets are intended to be used for exploring machine learning applications in materials science. They are formatted in simple form and in particular for easy input into the MAterials Simulation Toolkit - Machine Learning (MAST-ML) package (see https://github.com/uw-cmg/MAST-ML).Each dataset is a materials property of interest and associated descriptors. For detailed information, please see the attached REAME text file.The first dataset for dilute solute diffusion can be used to predict an effective diffusion barrier for a solute element moving through another host element. The dataset has been calculated with DFT methods.The second dataset for perovskite stability gives energies of compostions of potential perovskite materials relative to the convex hull calculated with DFT. The perovskite dataset also includes columns with information about the A site, B site, and X site in the perovskite structure in order to perform more advanced grouping of the data.The third dataset is a metallic glasses dataset which has values of reduced glass transition temperature (Trg) for a variety of metallic alloys. An additional column is included for majority element for each alloy, which can be an interesting property to group on during tests.

Facebook

Twitter

Click to copy link

Link copied

Cite

sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset

LLM - Detect AI Generated Text Dataset

LLM - Detect AI Generated Text Training Essay Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 8, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

sunil thite

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

Clear search

Close search

Google apps

Main menu

LLM - Detect AI Generated Text Dataset

Text Classification Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

A collection of nine multi-label text classification datasets

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning &...

An Amharic News Text classification Dataset Dataset

Balinese Story Texts Dataset - Characters, Aliases, and their Classification...

LLM Text Generation Dataset

IPATH Dataset: 45,609 Curated Image-Text Pairs for Histopathology...

Amharic text dataset extracted from memes for hate speech detection or...

Date-Dataset

Context

Content

Acknowledgements

Inspiration

Trained models for multi-task multi-dataset learning for text classification...

A kiswahili Dataset for Development of Text-To-Speech System

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Yektanet Persian Web Text Classification Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Event-Dataset: Temporal information retrieval and text classification...

Trained models for multi-task multi-dataset learning for text classification...

WELFake dataset for fake news detection in text data

Processed twitter sentiment Dataset | Added Tokens

Machine Learning Materials Datasets

LLM - Detect AI Generated Text Dataset

LLM - Detect AI Generated Text Training Essay Dataset