100+ datasets found
  1. LLM - Detect AI Generated Text Dataset

    • kaggle.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sunil thite
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

    Dataset contains more than 28,000 essay written by student and AI generated.

    Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

  2. Text Classification Dataset

    • opendatabay.com
    .undefined
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
    Authors
    Opendatabay
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

    Dataset Features

    1. text: Contains individual English-language comments or posts sourced from various online platforms.

    2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

    0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

    Distribution

    • Format: CSV (Comma-Separated Values)
    • 2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)
    • File Size: Approximately 23.9 MB
    • Structure: Each row contains a single comment and its corresponding sentiment label.

    Usage

    This dataset is ideal for a variety of applications:

    • 1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

    • 2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

    • 3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

    Coverage

    • Geographic Coverage: Primarily English-language content from global online platforms

    • Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

    • Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

    License

    CC0

    Who Can Use It

    • Data Scientists: For training machine learning models.
    • Researchers: For academic or scientific studies.
    • Businesses: For analysis, insights, or AI development.
  3. i

    A collection of nine multi-label text classification datasets

    • ieee-dataport.org
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
    Explore at:
    Dataset updated
    Nov 4, 2024
    Authors
    Yiming Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RCV1

  4. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zip(798357692)Available download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  5. d

    Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning &...

    • datarade.ai
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kieli (2021). Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning & AI platforms [Dataset]. https://datarade.ai/data-products/a-fully-labelled-dataset-for-machine-learning-and-ai-platforms-kieli
    Explore at:
    .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Mar 20, 2021
    Dataset authored and provided by
    Kieli
    Area covered
    Djibouti, Fiji, Venezuela (Bolivarian Republic of), Antigua and Barbuda, Denmark, Uruguay, Tajikistan, Anguilla, Mauritius, Ethiopia
    Description

    Kieli labels audio speech, Image, Video & Text Data including semantic segmentation, named entity recognition (NER) and POS tagging. Kieli transforms unstructured data into high quality training data for the refinement of Artificial Intelligence and Machine Learning platforms. For over a decade, hundreds of organizations have relied on Kieli to deliver secure, high-quality training data and model validation for machine learning. At Kieli, we believe that accurate data is the most important factor in production learning models. We are committed to delivering the best quality data for the most enterprising organizations and helping you make strides in Artificial Intelligence. At Kieli, we're passionately dedicated to serving the Arabic, English and French markets. We work in all areas of industry: healthcare, technology and retail.

  6. P

    An Amharic News Text classification Dataset Dataset

    • paperswithcode.com
    Updated Mar 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Israel Abebe Azime; Nebil Mohammed (2021). An Amharic News Text classification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/an-amharic-news-text-classification-dataset
    Explore at:
    Dataset updated
    Mar 9, 2021
    Authors
    Israel Abebe Azime; Nebil Mohammed
    Description

    In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.

  7. m

    Balinese Story Texts Dataset - Characters, Aliases, and their Classification...

    • data.mendeley.com
    Updated Mar 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    I Made Satria Bimantara (2024). Balinese Story Texts Dataset - Characters, Aliases, and their Classification [Dataset]. http://doi.org/10.17632/h2tf5ymcp9.3
    Explore at:
    Dataset updated
    Mar 25, 2024
    Authors
    I Made Satria Bimantara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of 120 Balinese story texts (as known as Satua Bali) which have been annotated for narrative text analysis purposes, including character identification, alias clustering, and character classification into protagonist or antagonist. The labeling involved two Balinese native speakers who were fluent in understanding Balinese story texts. One of them is an expert in the fields of sociolinguistics and macrolinguistics. Reliability and level of agreement in the dataset are measured by Cohen's kappa coefficient, Jaccard similarity coefficient, and F1-score and all of them show almost perfect agreement values (>0,81). There are four main folders, each used for different narrative text analysis purposes: 1. First Dataset (charsNamedEntity): 89,917 annotated tokens with five character named entity labels (ANM, ADJ, PNAME, GODS, OBJ) for character named entity recognition purpose 2. Second Dataset (charsExtraction): 6,634 annotated sentences for the purpose of character identification at the sentence level 3. Third Dataset (charsAliasClustering): 930 lists of character groups from 120 story texts for the purpose of alias clustering 4. Fourth Dataset (charsClassification): 848 lists of character groups that have been classified into two groups (Protagonist and Antagonist)

  8. u

    LLM Text Generation Dataset

    • unidata.pro
    csv
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ (2025). LLM Text Generation Dataset [Dataset]. https://unidata.pro/datasets/llm-text-generation/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    LLM Text Generation dataset offers multilingual text samples from large language models, enriching AI’s natural language understanding

  9. IPATH Dataset: 45,609 Curated Image-Text Pairs for Histopathology...

    • zenodo.org
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyederfan Mirhosseini; Seyederfan Mirhosseini; Taran Rai; Pablo Jose Diaz Santana; Roberto La Ragione; Roberto La Ragione; Nicholas Bacon; Nicholas Bacon; Kevin Wells; Kevin Wells; Taran Rai; Pablo Jose Diaz Santana (2025). IPATH Dataset: 45,609 Curated Image-Text Pairs for Histopathology Applications [Dataset]. http://doi.org/10.5281/zenodo.14278846
    Explore at:
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Seyederfan Mirhosseini; Seyederfan Mirhosseini; Taran Rai; Pablo Jose Diaz Santana; Roberto La Ragione; Roberto La Ragione; Nicholas Bacon; Nicholas Bacon; Kevin Wells; Kevin Wells; Taran Rai; Pablo Jose Diaz Santana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent advancements in artificial intelligence (AI) have enabled the identification of patterns in pathology images, improving diagnostic accuracy and decision support systems. However, progress has been limited due to the lack of publicly available medical images. To address this scarcity, we explore Instagram as a novel source of pathology images with expert annotations. We curated the IPATH dataset from Instagram, comprising 45,609 pathology image-text pairs, using a combination of classifiers, large language models, and manual filtering. To demonstrate the value of this dataset, we developed a multimodal AI model called IP-CLIP by fine-tuning the pre-trained CLIP model using the IPATH dataset. IP-CLIP outperforms the original CLIP model in classifying new pathology images on two downstream tasks—zero-shot classification and linear probing—using two external histopathology datasets. These results surpass the CLIP baseline model and demonstrate the effectiveness of the IPATH dataset, highlighting the potential of social media data to advance AI models for medical image classification.

  10. m

    Amharic text dataset extracted from memes for hate speech detection or...

    • data.mendeley.com
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
    Explore at:
    Dataset updated
    Jun 8, 2023
    Authors
    Mequanent Degu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.

  11. Date-Dataset

    • kaggle.com
    Updated Aug 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nishtha kukreti (2021). Date-Dataset [Dataset]. https://www.kaggle.com/nishthakukreti/datedataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    nishtha kukreti
    Description

    Context

    This is the random date data-set generated by me using python script to create a Machine Learning model to tag the date in any given document.

    Content

    This data-set contains whether the given word of word are dates or not

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Implement Machine Learning model or Deep learning Model or train a custom spacy to tag the date and other POS.

  12. I

    Trained models for multi-task multi-dataset learning for text classification...

    • databank.illinois.edu
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra, Trained models for multi-task multi-dataset learning for text classification in tweets [Dataset]. http://doi.org/10.13012/B2IDB-1917934_V1
    Explore at:
    Authors
    Shubhanshu Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trained models for multi-task multi-dataset learning for text classification in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Models were trained using: https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py See https://github.com/socialmediaie/SocialMediaIE and https://socialmediaie.github.io for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

  13. m

    A kiswahili Dataset for Development of Text-To-Speech System

    • data.mendeley.com
    Updated Nov 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
    Explore at:
    Dataset updated
    Nov 30, 2021
    Authors
    Kiptoo Rono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.

  14. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  15. o

    Yektanet Persian Web Text Classification Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Yektanet Persian Web Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/886a3949-9499-4647-9038-b7e8caa26cfc
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Data Science and Analytics
    Description

    The Yektanet Dataset is a real Persian web data collection, meticulously refined and gathered by the Yektanet platform. Its primary purpose is to serve as an industrial case study for applying machine learning in Natural Language Processing (NLP) [1]. This dataset enables the development of machine learning models capable of predicting the categorical topic of a document based on its text features, such as the title, description, and full text content [1]. It provides a valuable resource for training and evaluating machine learning models in document categorisation and topic prediction [1].

    Columns

    The dataset consists of multiple instances, each containing various features that provide information about the documents [1]. The main target variable is the category column, which indicates the topic or category of the content [1]. Additional features include: * Description: This column provides a description of the document [1]. * Text_content: This column holds the complete text content of the document [1]. * Title: This column represents the title of the document [1]. * h1 and h2: These columns contain content found within the HTML tags h1 and h2, respectively [1]. * URL: This column specifies the link address associated with the document [1]. * Domain: This column indicates the domain or website from which the document originates [1]. * Id: This column represents the unique identifier for each link [1].

    Distribution

    The Yektanet dataset comprises multiple instances, with approximately 5206 records based on the distribution of category labels [1, 2]. The dataset includes unique values for columns such as ID (4786 unique values), text content (4720 unique values), title (4614 unique values), and description (4399 unique values) [3]. Data files are typically provided in CSV format [4].

    Usage

    This dataset is ideally suited for developing and evaluating machine learning models for document categorisation and topic prediction tasks [1]. It can be used for applications involving Natural Language Processing (NLP), such as: * Training machine learning models to predict document topics [1]. * Developing text classification systems [1]. * Research into real-world web data analysis [1]. * Exploring feature engineering for NLP tasks [1].

    Coverage

    The Yektanet dataset is a real Persian web data collection [1]. Its region of coverage is global [5]. It includes content across various topics, with dominant categories such as 'سلامت' (health) at 13% and 'ورزش' (sports) at 11% [3]. The data availability is not restricted to specific groups or years beyond being a current web data collection [1].

    License

    CC By

    Who Can Use It

    The dataset is primarily intended for researchers and practitioners in the fields of machine learning and Natural Language Processing (NLP) [1]. Ideal users include data scientists, AI/ML engineers, academics, and anyone interested in document classification, topic modelling, or working with Persian text data [1].

    Dataset Name Suggestions

    • Yektanet Persian Web Text Classification Dataset
    • Persian Document Topic Prediction Data
    • Yektanet NLP Classification Corpus
    • Web Text Categorisation Dataset (Persian)
    • Yektanet Machine Learning Text Dataset

    Attributes

    Original Data Source: Yektanet( Dataset for Text Classification)

  16. i

    Event-Dataset: Temporal information retrieval and text classification...

    • ieee-dataport.org
    Updated Nov 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Islam (2020). Event-Dataset: Temporal information retrieval and text classification dataset [Dataset]. https://ieee-dataport.org/documents/event-dataset-temporal-information-retrieval-and-text-classification-dataset
    Explore at:
    Dataset updated
    Nov 6, 2020
    Authors
    Muhammad Islam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    2018

  17. I

    Trained models for multi-task multi-dataset learning for text classification...

    • databank.illinois.edu
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra (2020). Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets [Dataset]. http://doi.org/10.13012/B2IDB-1094364_V1
    Explore at:
    Dataset updated
    Aug 4, 2020
    Authors
    Shubhanshu Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py See https://github.com/socialmediaie/SocialMediaIE and https://socialmediaie.github.io for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

  18. WELFake dataset for fake news detection in text data

    • zenodo.org
    csv
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.

  19. Processed twitter sentiment Dataset | Added Tokens

    • kaggle.com
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Halemo GPA
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

    1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

    Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

  20. Machine Learning Materials Datasets

    • figshare.com
    txt
    Updated Sep 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dane Morgan (2018). Machine Learning Materials Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7017254.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 11, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Dane Morgan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Three datasets are intended to be used for exploring machine learning applications in materials science. They are formatted in simple form and in particular for easy input into the MAterials Simulation Toolkit - Machine Learning (MAST-ML) package (see https://github.com/uw-cmg/MAST-ML).Each dataset is a materials property of interest and associated descriptors. For detailed information, please see the attached REAME text file.The first dataset for dilute solute diffusion can be used to predict an effective diffusion barrier for a solute element moving through another host element. The dataset has been calculated with DFT methods.The second dataset for perovskite stability gives energies of compostions of potential perovskite materials relative to the convex hull calculated with DFT. The perovskite dataset also includes columns with information about the A site, B site, and X site in the perovskite structure in order to perform more advanced grouping of the data.The third dataset is a metallic glasses dataset which has values of reduced glass transition temperature (Trg) for a variety of metallic alloys. An additional column is included for majority element for each alloy, which can be an interesting property to group on during tests.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Organization logo

LLM - Detect AI Generated Text Dataset

LLM - Detect AI Generated Text Training Essay Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

Search
Clear search
Close search
Google apps
Main menu