Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.
Dataset contains more than 28,000 essay written by student and AI generated.
Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, and curved text instances. The training split and testing split have 1,255 images and 300 images, respectively.
In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RCV1
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AI & Human Generated Text
I am Using this dataset for AI Text Detection for https://exnrt.com.
Check Original DataSet GitHub Repository Here: https://github.com/panagiotisanagnostou/AI-GA
Description
The AI-GA dataset, short for Artificial Intelligence Generated Abstracts, comprises abstracts and titles. Half of these abstracts are generated by AI, while the remaining half are original. Primarily intended for research and experimentation in natural language… See the full description on the dataset page: https://huggingface.co/datasets/Ateeqq/AI-and-Human-Generated-Text.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a curated collection of fashion product images paired with their titles and descriptions, designed for training and fine-tuning multimodal AI models. Originally derived from Param Aggraval's "Fashion Product Images Dataset," it has undergone extensive preprocessing to improve usability and efficiency.
Preprocessing steps include:
1. Resizing all images to a median size of 1080 x 1440 px, preserving their original aspect ratio.
2. Streamlining the reference CSV file to retain only essential fields: image file name, display name, product description, and category.
3. Removing redundant style JSON files to minimize dataset complexity.
These optimizations have reduced the dataset size by 73%, making it lighter and faster to use without compromising data quality. This refined dataset is ideal for research and applications in multimodal AI, including tasks like product recommendation, image-text matching, and domain-specific fine-tuning.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc.
The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code).
The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67).
Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.
1. text: Contains individual English-language comments or posts sourced from various online platforms.
2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:
0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment
This dataset is ideal for a variety of applications:
1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.
2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.
3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.
Geographic Coverage: Primarily English-language content from global online platforms
Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.
Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.
CC0
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The English & Chinese Special Angle Text Dataset contains images of text displayed at various angles and orientations in both English and Chinese. It includes text from sources like signs, advertisements, and documents that are not presented in standard horizontal formats. This dataset is used for training and evaluating text detection and recognition models, particularly those capable of handling text in non-traditional orientations and perspectives.
MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.
Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).
Find Datasheets in the Gopher paper.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for AI Text Dectection Pile
Dataset Summary
This is a large scale dataset intended for AI Text Detection tasks, geared toward long-form text and essays. It contains samples of both human text and AI-generated text from GPT2, GPT3, ChatGPT, GPTJ. Here is the (tentative) breakdown:
Human Text
Dataset Num Samples Link
Reddit WritingPromps 570k Link
OpenAI Webtext 260k Link
HC3 (Human Responses) 58k Link
ivypanda-essays TODO TODO… See the full description on the dataset page: https://huggingface.co/datasets/artem9k/ai-text-detection-pile.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip
. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours
subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar
as text_recognition_multipro.py
.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar
.
Parameter sweeps are automated by param_sweep.rb
. This file also shows how to invoke all of these components.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2.
Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below.
{ "text": , "labels": [
The hierarchical structure of labels in each dataset is documented in this repository.
These datasets have been presented in this paper:
Some of these datasets have also been used in:
These datasets are partially derived from previous work, namely:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for my-distiset-b845cf19
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidberenstein1957/my-distiset-b845cf19/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/argilla/synthetic-domain-text-classification.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a large multilingual toxicity dataset with 3M rows of text data from 55 natural languages, all of which are written/sent by humans, not machine translation models. The preprocessed training data alone consists of 2,880,667 rows of comments, tweets, and messages. Among these rows, 416,529 are classified as toxic, while the remaining 2,463,773 are considered neutral. Below is a table to illustrate the data composition:
Toxic Neutral Total
multilingual-train-deduplicated.csv… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/toxi-text-3M.
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
This data asset contains data files of text extracted from pdf reports on the Development Experience Clearinghouse (DEC) for the years 2011 to 2021 (as of July 2021). It includes three specific "Document types" identified by the DEC: Final Contractor/Grantee Report, Final Evaluation Report, and Special Evaluation. Each PDF document labeled as one of these three document types and labeled with a publication year from 2011 to 2021 was downloaded from the DEC in July 2011. The dataset includes text data files from 2,579 Final Contractor/Grantee Reports, 1,299 Final Evaluation reports, and 1,323 Special Evaluation reports. Raw text from each of these PDFs was extracted and saved as individual csv files, the names of which correspond to the Document ID of the PDF document on the DEC. Within each csv file, the raw text is split into paragraphs and corresponding sentences. In addition, to enable Natural Language Processing of the data, the sentences are cleaned by removing unnecessary special characters, punctuation, and numbers, and each word is stemmed to its root to remove inflections (e.g. pluralization and conjugation). This data could be used to analyze trends in USAID's programming approaches and terminology. This data was compiled for USAID/PPL/LER with the Program Cycle Mechanism.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
DataSET For Text Localization is a dataset for object detection tasks - it contains Text annotations for 386 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Text is prevalent in natural scenes around us in the form of road signs, billboards, house numbers and place names. Text labels are also an integral part of cadastral maps and floor plans. Extracting this text can provide additional context and details about the places the text describes and the information it conveys.This deep learning model is based on the PaddleOCR model and uses optical character recognition (OCR) technology to detect text in images. This model was trained on a large dataset of different types and styles of text with diverse background and contexts, allowing for precise text extraction. It can be applied to various tasks such as automatically detecting and reading text from billboards, sign boards, scanned maps, etc., thereby converting images containing text to actionable data.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS. The PaddleOCR library is additionally required by this model. PaddleOCR can be installed using the following command in ArcGIS Python Command Prompt: conda install paddleocr -c esriFine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputHigh-resolution, 3-band street-level imagery/oriented imagery or scanned maps, with medium to large size text.OutputA feature layer with the recognized text and bounding box around it.Model architectureThis model is based on the open-source PaddleOCR model by PaddlePaddle.Sample resultsHere are a few results from the model.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.
Dataset contains more than 28,000 essay written by student and AI generated.
Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay