100+ datasets found
  1. LLM - Detect AI Generated Text Dataset

    • kaggle.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sunil thite
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

    Dataset contains more than 28,000 essay written by student and AI generated.

    Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

  2. P

    Total-Text Dataset

    • paperswithcode.com
    • datasetninja.com
    • +2more
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chee Kheng Chng; Chee Seng Chan (2023). Total-Text Dataset [Dataset]. https://paperswithcode.com/dataset/total-text
    Explore at:
    Dataset updated
    Sep 6, 2023
    Authors
    Chee Kheng Chng; Chee Seng Chan
    Description

    Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, and curved text instances. The training split and testing split have 1,255 images and 300 images, respectively.

  3. P

    An Amharic News Text classification Dataset Dataset

    • paperswithcode.com
    Updated Mar 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Israel Abebe Azime; Nebil Mohammed (2021). An Amharic News Text classification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/an-amharic-news-text-classification-dataset
    Explore at:
    Dataset updated
    Mar 9, 2021
    Authors
    Israel Abebe Azime; Nebil Mohammed
    Description

    In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.

  4. i

    A collection of nine multi-label text classification datasets

    • ieee-dataport.org
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
    Explore at:
    Dataset updated
    Nov 4, 2024
    Authors
    Yiming Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RCV1

  5. h

    AI-and-Human-Generated-Text

    • huggingface.co
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ateeq Azam (2025). AI-and-Human-Generated-Text [Dataset]. https://huggingface.co/datasets/Ateeqq/AI-and-Human-Generated-Text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2025
    Authors
    Ateeq Azam
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AI & Human Generated Text

      I am Using this dataset for AI Text Detection for https://exnrt.com.
    

    Check Original DataSet GitHub Repository Here: https://github.com/panagiotisanagnostou/AI-GA

      Description
    

    The AI-GA dataset, short for Artificial Intelligence Generated Abstracts, comprises abstracts and titles. Half of these abstracts are generated by AI, while the remaining half are original. Primarily intended for research and experimentation in natural language… See the full description on the dataset page: https://huggingface.co/datasets/Ateeqq/AI-and-Human-Generated-Text.

  6. Fashion Product Images and Text Dataset

    • kaggle.com
    Updated Nov 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmal Sankalana (2024). Fashion Product Images and Text Dataset [Dataset]. https://www.kaggle.com/datasets/nirmalsankalana/fashion-product-text-images-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Kaggle
    Authors
    Nirmal Sankalana
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a curated collection of fashion product images paired with their titles and descriptions, designed for training and fine-tuning multimodal AI models. Originally derived from Param Aggraval's "Fashion Product Images Dataset," it has undergone extensive preprocessing to improve usability and efficiency.

    Preprocessing steps include:
    1. Resizing all images to a median size of 1080 x 1440 px, preserving their original aspect ratio.
    2. Streamlining the reference CSV file to retain only essential fields: image file name, display name, product description, and category.
    3. Removing redundant style JSON files to minimize dataset complexity.

    These optimizations have reduced the dataset size by 73%, making it lighter and faster to use without compromising data quality. This refined dataset is ideal for research and applications in multimodal AI, including tasks like product recommendation, image-text matching, and domain-specific fine-tuning.

  7. E

    Data from: Text classification model fastText-Trendi-Topics 1.0

    • live.european-language-grid.eu
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Text classification model fastText-Trendi-Topics 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20819
    Explore at:
    Dataset updated
    Oct 27, 2022
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc.

    The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf

    The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code).

    The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67).

    Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.

  8. o

    Text Classification Dataset

    • opendatabay.com
    .csv
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Opendatabay
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

    Dataset Features

    1. text: Contains individual English-language comments or posts sourced from various online platforms.

    2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

    0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

    Distribution

    • Format: CSV (Comma-Separated Values)
    • 2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)
    • File Size: Approximately 23.9 MB
    • Structure: Each row contains a single comment and its corresponding sentiment label.

    Usage

    This dataset is ideal for a variety of applications:

    • 1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

    • 2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

    • 3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

    Coverage

    • Geographic Coverage: Primarily English-language content from global online platforms

    • Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

    • Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

    License

    CC0

    Who Can Use It

    • Data Scientists: For training machine learning models.
    • Researchers: For academic or scientific studies.
    • Businesses: For analysis, insights, or AI development.
  9. Celeb-VText

    • kaggle.com
    Updated Mar 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saba Hesaraki (2024). Celeb-VText [Dataset]. https://www.kaggle.com/datasets/sabahesaraki/celeb-vtext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saba Hesaraki
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

  10. s

    English & Chinese Special Angle Text Dataset

    • so.shaip.com
    • ro.shaip.com
    • +81more
    json
    Updated Dec 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). English & Chinese Special Angle Text Dataset [Dataset]. https://so.shaip.com/offerings/language-text-datasets/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 25, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The English & Chinese Special Angle Text Dataset contains images of text displayed at various angles and orientations in both English and Chinese. It includes text from sources like signs, advertisements, and documents that are not presented in standard horizontal formats. This dataset is used for training and evaluating text detection and recognition models, particularly those capable of handling text in non-traditional orientations and perspectives.

  11. P

    MassiveText Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext
    Explore at:
    Dataset updated
    May 23, 2025
    Authors
    Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving
    Description

    MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

    Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

    Find Datasheets in the Gopher paper.

  12. h

    ai-text-detection-pile

    • huggingface.co
    Updated Feb 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artem Yatsenko (2023). ai-text-detection-pile [Dataset]. https://huggingface.co/datasets/artem9k/ai-text-detection-pile
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2023
    Authors
    Artem Yatsenko
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for AI Text Dectection Pile

      Dataset Summary
    

    This is a large scale dataset intended for AI Text Detection tasks, geared toward long-form text and essays. It contains samples of both human text and AI-generated text from GPT2, GPT3, ChatGPT, GPTJ. Here is the (tentative) breakdown:

      Human Text
    

    Dataset Num Samples Link

    Reddit WritingPromps 570k Link

    OpenAI Webtext 260k Link

    HC3 (Human Responses) 58k Link

    ivypanda-essays TODO TODO… See the full description on the dataset page: https://huggingface.co/datasets/artem9k/ai-text-detection-pile.

  13. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zip(798357692)Available download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  14. Hierarchical Text Classification corpora

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Mar 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Zangari; Alessandro Zangari; Matteo Marcuzzo; Matteo Marcuzzo; Matteo Rizzo; Matteo Rizzo; Andrea Albarelli; Andrea Albarelli; Andrea Gasparetto; Andrea Gasparetto (2024). Hierarchical Text Classification corpora [Dataset]. http://doi.org/10.5281/zenodo.7319519
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessandro Zangari; Alessandro Zangari; Matteo Marcuzzo; Matteo Marcuzzo; Matteo Rizzo; Matteo Rizzo; Andrea Albarelli; Andrea Albarelli; Andrea Gasparetto; Andrea Gasparetto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2.

    • The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories.
    • The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component.
    • Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details).

    Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below.

    { "text": , "labels": [

    The hierarchical structure of labels in each dataset is documented in this repository.

    These datasets have been presented in this paper:

    Some of these datasets have also been used in:

    • "Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984
    • "A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link

    These datasets are partially derived from previous work, namely:

    • [Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018
    • [WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134
    • [Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511
  15. h

    synthetic-domain-text-classification

    • huggingface.co
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla (2025). synthetic-domain-text-classification [Dataset]. https://huggingface.co/datasets/argilla/synthetic-domain-text-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Argilla
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for my-distiset-b845cf19

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidberenstein1957/my-distiset-b845cf19/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/argilla/synthetic-domain-text-classification.

  16. h

    toxi-text-3M

    • huggingface.co
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fred Zhang (2023). toxi-text-3M [Dataset]. https://huggingface.co/datasets/FredZhang7/toxi-text-3M
    Explore at:
    Dataset updated
    Dec 12, 2023
    Authors
    Fred Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a large multilingual toxicity dataset with 3M rows of text data from 55 natural languages, all of which are written/sent by humans, not machine translation models. The preprocessed training data alone consists of 2,880,667 rows of comments, tweets, and messages. Among these rows, 416,529 are classified as toxic, while the remaining 2,463,773 are considered neutral. Below is a table to illustrate the data composition:

    Toxic Neutral Total

    multilingual-train-deduplicated.csv… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/toxi-text-3M.

  17. P

    Data from: WebText Dataset

    • paperswithcode.com
    Updated May 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2023). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
    Explore at:
    Dataset updated
    May 22, 2023
    Authors
    Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
    Description

    WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

    WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.

  18. Natural Language Processing Text Data from Final Contractor/Grantee Reports...

    • catalog.data.gov
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.usaid.gov (2024). Natural Language Processing Text Data from Final Contractor/Grantee Reports and Evaluation Reports (2011-2021) [Dataset]. https://catalog.data.gov/dataset/natural-language-processing-text-data-from-final-contractor-grantee-reports-and-evalu-2011
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    United States Agency for International Developmenthttps://usaid.gov/
    Description

    This data asset contains data files of text extracted from pdf reports on the Development Experience Clearinghouse (DEC) for the years 2011 to 2021 (as of July 2021). It includes three specific "Document types" identified by the DEC: Final Contractor/Grantee Report, Final Evaluation Report, and Special Evaluation. Each PDF document labeled as one of these three document types and labeled with a publication year from 2011 to 2021 was downloaded from the DEC in July 2011. The dataset includes text data files from 2,579 Final Contractor/Grantee Reports, 1,299 Final Evaluation reports, and 1,323 Special Evaluation reports. Raw text from each of these PDFs was extracted and saved as individual csv files, the names of which correspond to the Document ID of the PDF document on the DEC. Within each csv file, the raw text is split into paragraphs and corresponding sentences. In addition, to enable Natural Language Processing of the data, the sentences are cleaned by removing unnecessary special characters, punctuation, and numbers, and each word is stemmed to its root to remove inflections (e.g. pluralization and conjugation). This data could be used to analyze trends in USAID's programming approaches and terminology. This data was compiled for USAID/PPL/LER with the Program Cycle Mechanism.

  19. R

    Dataset For Text Localization Dataset

    • universe.roboflow.com
    zip
    Updated Dec 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Practise Dataset (2022). Dataset For Text Localization Dataset [Dataset]. https://universe.roboflow.com/practise-dataset/dataset-for-text-localization
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 17, 2022
    Dataset authored and provided by
    Practise Dataset
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    DataSET For Text Localization

    ## Overview
    
    DataSET For Text Localization is a dataset for object detection tasks - it contains Text annotations for 386 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  20. Scene Text Parsing

    • hub.arcgis.com
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2022). Scene Text Parsing [Dataset]. https://hub.arcgis.com/content/d0989c3375194406b291dae18857b407
    Explore at:
    Dataset updated
    Mar 16, 2022
    Dataset authored and provided by
    Esrihttp://esri.com/
    Description

    Text is prevalent in natural scenes around us in the form of road signs, billboards, house numbers and place names. Text labels are also an integral part of cadastral maps and floor plans. Extracting this text can provide additional context and details about the places the text describes and the information it conveys.This deep learning model is based on the PaddleOCR model and uses optical character recognition (OCR) technology to detect text in images. This model was trained on a large dataset of different types and styles of text with diverse background and contexts, allowing for precise text extraction. It can be applied to various tasks such as automatically detecting and reading text from billboards, sign boards, scanned maps, etc., thereby converting images containing text to actionable data.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS. The PaddleOCR library is additionally required by this model. PaddleOCR can be installed using the following command in ArcGIS Python Command Prompt: conda install paddleocr -c esriFine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputHigh-resolution, 3-band street-level imagery/oriented imagery or scanned maps, with medium to large size text.OutputA feature layer with the recognized text and bounding box around it.Model architectureThis model is based on the open-source PaddleOCR model by PaddlePaddle.Sample resultsHere are a few results from the model.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Organization logo

LLM - Detect AI Generated Text Dataset

LLM - Detect AI Generated Text Training Essay Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

Search
Clear search
Close search
Google apps
Main menu