100+ datasets found

LLM - Detect AI Generated Text Dataset
kaggle.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
P
Total-Text Dataset
paperswithcode.com
datasetninja.com
+2more
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chee Kheng Chng; Chee Seng Chan (2023). Total-Text Dataset [Dataset]. https://paperswithcode.com/dataset/total-text
Explore at:
Dataset updated
Sep 6, 2023
Authors
Chee Kheng Chng; Chee Seng Chan
Description
Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, and curved text instances. The training split and testing split have 1,255 images and 300 images, respectively.
P
An Amharic News Text classification Dataset Dataset
paperswithcode.com
Updated Mar 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israel Abebe Azime; Nebil Mohammed (2021). An Amharic News Text classification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/an-amharic-news-text-classification-dataset
Explore at:
Dataset updated
Mar 9, 2021
Authors
Israel Abebe Azime; Nebil Mohammed
Description
In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
i
A collection of nine multi-label text classification datasets
ieee-dataport.org
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
Explore at:
Dataset updated
Nov 4, 2024
Authors
Yiming Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RCV1
h
AI-and-Human-Generated-Text
huggingface.co
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ateeq Azam (2025). AI-and-Human-Generated-Text [Dataset]. https://huggingface.co/datasets/Ateeqq/AI-and-Human-Generated-Text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 3, 2025
Authors
Ateeq Azam
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AI & Human Generated Text

I am Using this dataset for AI Text Detection for https://exnrt.com.

Check Original DataSet GitHub Repository Here: https://github.com/panagiotisanagnostou/AI-GA

Description

The AI-GA dataset, short for Artificial Intelligence Generated Abstracts, comprises abstracts and titles. Half of these abstracts are generated by AI, while the remaining half are original. Primarily intended for research and experimentation in natural language… See the full description on the dataset page: https://huggingface.co/datasets/Ateeqq/AI-and-Human-Generated-Text.
Fashion Product Images and Text Dataset
kaggle.com
Updated Nov 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmal Sankalana (2024). Fashion Product Images and Text Dataset [Dataset]. https://www.kaggle.com/datasets/nirmalsankalana/fashion-product-text-images-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2024
Dataset provided by
Kaggle
Authors
Nirmal Sankalana
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a curated collection of fashion product images paired with their titles and descriptions, designed for training and fine-tuning multimodal AI models. Originally derived from Param Aggraval's "Fashion Product Images Dataset," it has undergone extensive preprocessing to improve usability and efficiency.

Preprocessing steps include:
1. Resizing all images to a median size of 1080 x 1440 px, preserving their original aspect ratio.
2. Streamlining the reference CSV file to retain only essential fields: image file name, display name, product description, and category.
3. Removing redundant style JSON files to minimize dataset complexity.

These optimizations have reduced the dataset size by 73%, making it lighter and faster to use without compromising data quality. This refined dataset is ideal for research and applications in multimodal AI, including tasks like product recommendation, image-text matching, and domain-specific fine-tuning.
E
Data from: Text classification model fastText-Trendi-Topics 1.0
live.european-language-grid.eu
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Text classification model fastText-Trendi-Topics 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20819
Explore at:
Dataset updated
Oct 27, 2022
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc.

The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf

The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code).

The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67).

Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.
o
Text Classification Dataset
opendatabay.com
.csv
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
Explore at:
.csvAvailable download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Opendatabay
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

Dataset Features

1. text: Contains individual English-language comments or posts sourced from various online platforms.

2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

Distribution

Format: CSV (Comma-Separated Values)

2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)

File Size: Approximately 23.9 MB

Structure: Each row contains a single comment and its corresponding sentiment label.

Usage

This dataset is ideal for a variety of applications:

1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

Coverage

Geographic Coverage: Primarily English-language content from global online platforms

Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

License

CC0

Who Can Use It

Data Scientists: For training machine learning models.

Researchers: For academic or scientific studies.

Businesses: For analysis, insights, or AI development.
Celeb-VText
kaggle.com
Updated Mar 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saba Hesaraki (2024). Celeb-VText [Dataset]. https://www.kaggle.com/datasets/sabahesaraki/celeb-vtext
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saba Hesaraki
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.
s
English & Chinese Special Angle Text Dataset
so.shaip.com
ro.shaip.com
+81more
json
Updated Dec 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). English & Chinese Special Angle Text Dataset [Dataset]. https://so.shaip.com/offerings/language-text-datasets/
Explore at:
jsonAvailable download formats
Dataset updated
Dec 25, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The English & Chinese Special Angle Text Dataset contains images of text displayed at various angles and orientations in both English and Chinese. It includes text from sources like signs, advertisements, and documents that are not presented in standard horizontal formats. This dataset is used for training and evaluating text detection and recognition models, particularly those capable of handling text in non-traditional orientations and perspectives.
P
MassiveText Dataset
paperswithcode.com
library.toponeai.link
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext
Explore at:
Dataset updated
May 23, 2025
Authors
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving
Description
MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

Find Datasheets in the Gopher paper.
h
ai-text-detection-pile
huggingface.co
Updated Feb 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artem Yatsenko (2023). ai-text-detection-pile [Dataset]. https://huggingface.co/datasets/artem9k/ai-text-detection-pile
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2023
Authors
Artem Yatsenko
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for AI Text Dectection Pile

Dataset Summary

This is a large scale dataset intended for AI Text Detection tasks, geared toward long-form text and essays. It contains samples of both human text and AI-generated text from GPT2, GPT3, ChatGPT, GPTJ. Here is the (tentative) breakdown:

Human Text

Dataset Num Samples Link

Reddit WritingPromps 570k Link

OpenAI Webtext 260k Link

HC3 (Human Responses) 58k Link

ivypanda-essays TODO TODO… See the full description on the dataset page: https://huggingface.co/datasets/artem9k/ai-text-detection-pile.
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zip(798357692)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Hierarchical Text Classification corpora
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Zangari; Alessandro Zangari; Matteo Marcuzzo; Matteo Marcuzzo; Matteo Rizzo; Matteo Rizzo; Andrea Albarelli; Andrea Albarelli; Andrea Gasparetto; Andrea Gasparetto (2024). Hierarchical Text Classification corpora [Dataset]. http://doi.org/10.5281/zenodo.7319519
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7319519
Dataset updated
Mar 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessandro Zangari; Alessandro Zangari; Matteo Marcuzzo; Matteo Marcuzzo; Matteo Rizzo; Matteo Rizzo; Andrea Albarelli; Andrea Albarelli; Andrea Gasparetto; Andrea Gasparetto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2.

The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories.

The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component.

Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details).

Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below.

{ "text": , "labels": [

The hierarchical structure of labels in each dataset is documented in this repository.

These datasets have been presented in this paper:

"Hierarchical Text Classification and its Foundations: a Review of Current Research" - DOI: 10.3390/electronics13071199

Some of these datasets have also been used in:

"Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984

"A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link

These datasets are partially derived from previous work, namely:

[Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018

[WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134

[Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511
h
synthetic-domain-text-classification
huggingface.co
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2025). synthetic-domain-text-classification [Dataset]. https://huggingface.co/datasets/argilla/synthetic-domain-text-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Argilla
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for my-distiset-b845cf19

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidberenstein1957/my-distiset-b845cf19/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/argilla/synthetic-domain-text-classification.
h
toxi-text-3M
huggingface.co
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fred Zhang (2023). toxi-text-3M [Dataset]. https://huggingface.co/datasets/FredZhang7/toxi-text-3M
Explore at:
Dataset updated
Dec 12, 2023
Authors
Fred Zhang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a large multilingual toxicity dataset with 3M rows of text data from 55 natural languages, all of which are written/sent by humans, not machine translation models. The preprocessed training data alone consists of 2,880,667 rows of comments, tweets, and messages. Among these rows, 416,529 are classified as toxic, while the remaining 2,463,773 are considered neutral. Below is a table to illustrate the data composition:

Toxic Neutral Total

multilingual-train-deduplicated.csv… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/toxi-text-3M.
P
Data from: WebText Dataset
paperswithcode.com
Updated May 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2023). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
Explore at:
Dataset updated
May 22, 2023
Authors
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
Description
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
Natural Language Processing Text Data from Final Contractor/Grantee Reports...
catalog.data.gov
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.usaid.gov (2024). Natural Language Processing Text Data from Final Contractor/Grantee Reports and Evaluation Reports (2011-2021) [Dataset]. https://catalog.data.gov/dataset/natural-language-processing-text-data-from-final-contractor-grantee-reports-and-evalu-2011
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
United States Agency for International Developmenthttps://usaid.gov/
Description
This data asset contains data files of text extracted from pdf reports on the Development Experience Clearinghouse (DEC) for the years 2011 to 2021 (as of July 2021). It includes three specific "Document types" identified by the DEC: Final Contractor/Grantee Report, Final Evaluation Report, and Special Evaluation. Each PDF document labeled as one of these three document types and labeled with a publication year from 2011 to 2021 was downloaded from the DEC in July 2011. The dataset includes text data files from 2,579 Final Contractor/Grantee Reports, 1,299 Final Evaluation reports, and 1,323 Special Evaluation reports. Raw text from each of these PDFs was extracted and saved as individual csv files, the names of which correspond to the Document ID of the PDF document on the DEC. Within each csv file, the raw text is split into paragraphs and corresponding sentences. In addition, to enable Natural Language Processing of the data, the sentences are cleaned by removing unnecessary special characters, punctuation, and numbers, and each word is stemmed to its root to remove inflections (e.g. pluralization and conjugation). This data could be used to analyze trends in USAID's programming approaches and terminology. This data was compiled for USAID/PPL/LER with the Program Cycle Mechanism.
R
Dataset For Text Localization Dataset
universe.roboflow.com
zip
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Practise Dataset (2022). Dataset For Text Localization Dataset [Dataset]. https://universe.roboflow.com/practise-dataset/dataset-for-text-localization
Explore at:
zipAvailable download formats
Dataset updated
Dec 17, 2022
Dataset authored and provided by
Practise Dataset
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Text Bounding Boxes
Description
DataSET For Text Localization

## Overview DataSET For Text Localization is a dataset for object detection tasks - it contains Text annotations for 386 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Scene Text Parsing
hub.arcgis.com
Updated Mar 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2022). Scene Text Parsing [Dataset]. https://hub.arcgis.com/content/d0989c3375194406b291dae18857b407
Explore at:
Dataset updated
Mar 16, 2022
Dataset authored and provided by
Esrihttp://esri.com/
Description
Text is prevalent in natural scenes around us in the form of road signs, billboards, house numbers and place names. Text labels are also an integral part of cadastral maps and floor plans. Extracting this text can provide additional context and details about the places the text describes and the information it conveys.This deep learning model is based on the PaddleOCR model and uses optical character recognition (OCR) technology to detect text in images. This model was trained on a large dataset of different types and styles of text with diverse background and contexts, allowing for precise text extraction. It can be applied to various tasks such as automatically detecting and reading text from billboards, sign boards, scanned maps, etc., thereby converting images containing text to actionable data.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS. The PaddleOCR library is additionally required by this model. PaddleOCR can be installed using the following command in ArcGIS Python Command Prompt: conda install paddleocr -c esriFine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputHigh-resolution, 3-band street-level imagery/oriented imagery or scanned maps, with medium to large size text.OutputA feature layer with the recognized text and bounding box around it.Model architectureThis model is based on the open-source PaddleOCR model by PaddlePaddle.Sample resultsHere are a few results from the model.

Facebook

Twitter

Click to copy link

Link copied

Cite

sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset

LLM - Detect AI Generated Text Dataset

LLM - Detect AI Generated Text Training Essay Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 8, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

sunil thite

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

Clear search

Close search

Google apps

Main menu

LLM - Detect AI Generated Text Dataset

Total-Text Dataset

An Amharic News Text classification Dataset Dataset

A collection of nine multi-label text classification datasets

AI-and-Human-Generated-Text

Fashion Product Images and Text Dataset

Data from: Text classification model fastText-Trendi-Topics 1.0

Text Classification Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

Celeb-VText

English & Chinese Special Angle Text Dataset

MassiveText Dataset

ai-text-detection-pile

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Hierarchical Text Classification corpora

synthetic-domain-text-classification

toxi-text-3M

Data from: WebText Dataset

Natural Language Processing Text Data from Final Contractor/Grantee Reports...

Dataset For Text Localization Dataset

DataSET For Text Localization

Scene Text Parsing

LLM - Detect AI Generated Text Dataset

LLM - Detect AI Generated Text Training Essay Dataset