13 datasets found

o
OpenAI Summarization Corpus
opendatabay.com
.undefined
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). OpenAI Summarization Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/f95cfdab-cfe3-46a3-91a4-e5d8f15dcf15
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 12, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

To use this dataset for summarization tasks:

Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry.. Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content Research Ideas Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset. Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance. Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

Original Data Source: OpenAI Summarization Corpus
o
Data for: Extractive Summarization of Clinical Trial Descriptions
explore.openaire.eu
data.mendeley.com
Updated Jun 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Gulden (2019). Data for: Extractive Summarization of Clinical Trial Descriptions [Dataset]. http://doi.org/10.17632/gg58kc7zy7.1
Explore at:
Unique identifier
https://doi.org/10.17632/gg58kc7zy7.1
Dataset updated
Jun 13, 2019
Authors
Christian Gulden
Description
This archive contains the summarization corpus generated as a result of the filtering stages (trials-final.csv), the rouge scores for the generated summaries (rouge-results-parsed.csv), the data and results of the human evaluation (evaluation/ subfolder), the code used to generate the corpus (extract.r, filter.r, and determine_similarity_threshold.r). The summaries were generated using the summarize_all.py script.
E
Data from: Slovenian text summarization models
live.european-language-grid.eu
Updated Dec 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Slovenian text summarization models [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20871
Explore at:
Dataset updated
Dec 20, 2022
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:

Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.

Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.

The web service with a demo is available at https://slovenscina.eu/povzemanje.
m
Indonesian Travel Reviews for Text Summarization
data.mendeley.com
Updated Aug 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narandha A Ranggianto (2023). Indonesian Travel Reviews for Text Summarization [Dataset]. http://doi.org/10.17632/x2r86kfrhp.1
Explore at:
Unique identifier
https://doi.org/10.17632/x2r86kfrhp.1
Dataset updated
Aug 13, 2023
Authors
Narandha A Ranggianto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is intended for summarizing texts that can be used in extractive or abstractive approaches. This data is from year 2018-2022 and has three categories attraction, hotel, and restaurant. Each category consists of 100 different objects, resulting in a total of 300 objects across all categories. Each object has 5 reviews and 1 ground truth. The ground truth is a summary reference created by 3 experts, with 2 individuals holding bachelor's degrees in Indonesian Language and Literature Education and having worked as teachers for more than 2 years. The remaining person holds a bachelor's degree in Indonesian Literature and has 2 years of experience as an NLP annotator. Each category folder, such as the 'attraction' folder, contains 4 subfolders, each of which holds 25 objects.
o
Question-Answering Training and Testing Data
opendatabay.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Question-Answering Training and Testing Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The dataset consists of several columns that provide essential information for each entry. These columns include:

instruction: This column denotes the specific instruction given to the model for generating a response. responses: The model-generated responses to the given instruction are stored in this column. next_response: Following each previous response, this column indicates the subsequent response generated by the model. answer: The correct answer to the question asked in the instruction is provided in this column. is_human_response: This boolean column indicates whether a particular response was generated by a human or by an AI model. By analyzing this rich and diverse dataset, researchers and practitioners can gain valuable insights into various aspects of question answering tasks using AI models. It offers an opportunity for developers to train their models effectively while also facilitating rigorous evaluation methodologies.

Please note that specific dates are not included within this dataset description, focusing solely on providing accurate, informative, descriptive details about its content and purpose

How to use the dataset Understanding the Columns: This dataset contains several columns that provide important information for each entry:

instruction: The instruction given to the model for generating a response. responses: The model-generated responses to the given instruction. next_response: The next response generated by the model after the previous response. answer: The correct answer to the question asked in the instruction. is_human_response: Indicates whether a response is generated by a human or the model. Training Data (train.csv): Use train.csv file in this dataset as training data. It contains a large number of examples that you can use to train your question-answering models or algorithms.

Testing Data (test.csv): Use test.csv file in this dataset as testing data. It allows you to evaluate how well your models or algorithms perform on unseen questions and instructions.

Create Machine Learning Models: You can utilize this dataset's instructional components, including instructions, responses, next_responses, and human-generated answers, along with their respective labels like is_human_response (True/False) for training machine learning models specifically designed for question-answering tasks.

Evaluate Model Performance: After training your model using the provided training data, you can then test its performance on unseen questions from test.csv file by comparing its predicted responses with actual human-generated answers.

Data Augmentation: You can also augment this existing data in various ways such as paraphrasing existing instructions or generating alternative responses based on similar contexts within each example.

Build Conversational Agents: This dataset can be useful for training conversational agents or chatbots by leveraging the instruction-response pairs.

Remember, this dataset provides a valuable resource for building and evaluating question-answering models. Have fun exploring the data and discovering new insights!

Research Ideas Language Understanding: This dataset can be used to train models for question-answering tasks. Models can learn to understand and generate responses based on given instructions and previous responses.

Chatbot Development: With this dataset, developers can create chatbots that provide accurate and relevant answers to user questions. The models can be trained on various topics and domains, allowing the chatbot to answer a wide range of questions.

Educational Materials: This dataset can be used to develop educational materials, such as interactive quizzes or study guides. The models trained on this dataset can provide instant feedback and answers to students' questions, enhancing their learning experience.

Information Retrieval Systems: By training models on this dataset, information retrieval systems can be developed that help users find specific answers or information from large datasets or knowledge bases.

Customer Support: This dataset can be used in training customer support chatbots or virtual assistants that can provide quick and accurate responses to customer inquiries.

Language Generation Research: Researchers studying natural language generation (NLG) techniques could use this dataset for developing novel algorithms for generating coherent and contextually appropriate responses in question-answering scenarios.

Automatic Summarization Systems: Using the instruction-response pairs, automatic summarization systems could be trained that generate concise summaries of lengthy texts by understanding the main content of the text through answering questions.

Dialogue Systems Evaluation: The instruction-response pairs in this dataset could serve as a benchmark for evaluating the performance of dialogue systems in terms of response quality, relevance, coherence, etc.

9 . Machine Learning Training Data Augmentation : One clever ide
f
Table_3_Deriving comprehensive literature trends on multi-omics analysis...
frontiersin.figshare.com
docx
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal (2024). Table_3_Deriving comprehensive literature trends on multi-omics analysis studies in autism spectrum disorder using literature mining pipeline.DOCX [Dataset]. http://doi.org/10.3389/fnins.2024.1400412.s005
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2024.1400412.s005
Dataset updated
Nov 12, 2024
Dataset provided by
Frontiers
Authors
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autism spectrum disorder (ASD) is characterized by highly heterogenous abnormalities in functional brain connectivity affecting social behavior. There is a significant progress in understanding the molecular and genetic basis of ASD in the last decade using multi-omics approach. Mining this large volume of biomedical literature for insights requires considerable amount of manual intervention for curation. Machine learning and artificial intelligence fields are advancing toward simplifying data mining from unstructured text data. Here, we demonstrate our literature mining pipeline to accelerate data to insights. Using topic modeling and generative AI techniques, we present a pipeline that can classify scientific literature into thematic clusters and can help in a wide array of applications such as knowledgebase creation, conversational virtual assistant, and summarization. Employing our pipeline, we explored the ASD literature, specifically around multi-omics studies to understand the molecular interplay underlying autism brain.
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code
zenodo.org
application/gzip +1
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Rodriguez-Cardenas; Daniel Rodriguez-Cardenas (2025). SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [Dataset]. http://doi.org/10.5281/zenodo.14279563
Explore at:
json, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14279563
Dataset updated
Jan 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Rodriguez-Cardenas; Daniel Rodriguez-Cardenas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 2024
Description
This a repostiroy that contains datasets for SnipGEN replication. Each testbed is contained on a JSON file. There are three JSON file with curated data and tuned for a SE task. The tar.gz contains the raw data collected after mining github repostiries.

The first testbed is the summarization taks curated. This file contains the name of the repostory the snippet comes from, the file name, the commit message, the snippet and the linked documentation. This testbed is used for completing code from code description and the combination of docstring and code.

The code completion file, contains the used prompts for control, Treatment1 (T1) and Treatment(T2) and the predicted outcome from chatGPT.
f
Table_1_Deriving comprehensive literature trends on multi-omics analysis...
frontiersin.figshare.com
xlsx
Updated Nov 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal (2024). Table_1_Deriving comprehensive literature trends on multi-omics analysis studies in autism spectrum disorder using literature mining pipeline.XLSX [Dataset]. http://doi.org/10.3389/fnins.2024.1400412.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2024.1400412.s003
Dataset updated
Nov 12, 2024
Dataset provided by
Frontiers
Authors
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autism spectrum disorder (ASD) is characterized by highly heterogenous abnormalities in functional brain connectivity affecting social behavior. There is a significant progress in understanding the molecular and genetic basis of ASD in the last decade using multi-omics approach. Mining this large volume of biomedical literature for insights requires considerable amount of manual intervention for curation. Machine learning and artificial intelligence fields are advancing toward simplifying data mining from unstructured text data. Here, we demonstrate our literature mining pipeline to accelerate data to insights. Using topic modeling and generative AI techniques, we present a pipeline that can classify scientific literature into thematic clusters and can help in a wide array of applications such as knowledgebase creation, conversational virtual assistant, and summarization. Employing our pipeline, we explored the ASD literature, specifically around multi-omics studies to understand the molecular interplay underlying autism brain.
o
QA4MRE (Reading Comprehension Q&A)
opendatabay.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). QA4MRE (Reading Comprehension Q&A) [Dataset]. https://www.opendatabay.com/data/ai-ml/e20ba707-f7d5-4e77-b2da-e90a67e77b9d
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Healthcare Providers & Services Utilization
Description
The QA4MRE dataset offers a magnificent collection of passages with connected questions and answers, providing researchers with a defining set of data to work from. With its wide range, this has been the go-to source for many research projects like the CLEF 2011, 2012 and 2013 Shared Tasks - where training datasets are available for the main track as well as documents ready to be used in two pilot studies related to Alzheimer's disease and entrance exams. This expansive dataset can allow you to unleash your creativity in ways you never thought possible - uncovering new possibilities and exciting findings as it serves as an abundant source of information. No matter which field you come from or what kind of insights you’re looking for, this powerhouse dataset will have something special waiting just around the corner

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset How to Use the QA4MRE Dataset for Your Research The QA4MRE (Question Answering and Reading Comprehension) dataset is a great resource for researchers who want to use comprehensive datasets to explore creative approaches and solutions. This powerful dataset provides several versions of training and development data in the form of passages with accompanying questions and answers. Additionally, there are gold standard documents included that can be used in two different pilot studies related to Alzheimer’s disease as well as entrance exams. The following is a guide on how to make the most out of this valuable data set:

Analyze Data Structures - Once you've downloaded all necessary materials, it’s time for analyzing what structure each file follows in order access its contents accordingly; knowing which column helps refine your searching process as some files go beyond just providing questions & answers such as providing topic names associated with passage text relevant processing question asking comprehension testing etc.. The table below serves as basic overview each column provided in both train & dev variants found within this datasets:

Column Name Description Datatype
Topic name Name of topic passage represents String

Refine Data Searching Process - Lastly if plan develop an automated system/algorithm uncover precise contents from manipulated articles/passages then refining already established search process involving

Research Ideas Creating an automated question answering system that is capable of engaging in conversations with a user. This could be used as a teaching assistant to help students study for exams and other tests or as a virtual assistant for customer service. Developing a summarization tool dedicated specifically to the QA4MRE dataset, which can extract key information from each passage and output concise summaries with confidence scores indicating the likelihood of the summary being accurate compared to the original text. Utilizing natural language processing techniques to analyze questions related to Alzheimer’s disease and creating machine learning models that accurately predict patient responses when asked various sets of questions about their condition, thus aiding in diagnosing Alzheimer's Disease early on in its development stages

License

CC0

Original Data Source: QA4MRE (Reading Comprehension Q&A)
c
Trustpilot Reviews Dataset – 2 Million+ Real Customer Feedback Records
crawlfeeds.com
csv, zip
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Trustpilot Reviews Dataset – 2 Million+ Real Customer Feedback Records [Dataset]. https://crawlfeeds.com/datasets/trustpilot-reviews-dataset-1-million-real-customer-feedback-records
Explore at:
zip, csvAvailable download formats
Dataset updated
May 13, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Trustpilot Reviews Dataset – Over 2 Million Real Customer Reviews

Unlock the power of real customer feedback with Crawlfeeds' comprehensive Trustpilot Reviews Dataset, featuring over 2 million reviews collected from verified users across diverse industries and businesses worldwide.

This dataset is ideal for sentiment analysis, NLP training, brand reputation monitoring, and other AI/ML use cases that require authentic, high-volume user-generated content.

📌 Key Features

Records: 2,000,000+ user reviews

Language: English

Coverage: Global companies in e-commerce, SaaS, travel, finance, and more

Delivery: Instant download

Use Cases

Train and benchmark sentiment analysis models

Fine-tune large language models (LLMs) on real-world opinion data

Conduct brand reputation research or consumer trust studies

Build review summarization or opinion mining tools

🔗 Related Resource

Looking for a smaller sample?
Explore our 20K Trustpilot reviews set on Hugging Face:
👉 Trustpilot Reviews Dataset – 20K Sample
h
tldr-17
huggingface.co
Updated Jun 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2020
Dataset authored and provided by
Webis Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
m
Globale Verarbeitung natürlicher Sprache NLP im Markt für Gesundheits- und...
marketresearchintellect.com
Updated Sep 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Intellect (2024). Globale Verarbeitung natürlicher Sprache NLP im Markt für Gesundheits- und Biowissenschaften-Anteils-, Größen- und Branchenanalyse 2033 [Dataset]. https://www.marketresearchintellect.com/de/product/natural-language-processing-nlp-in-healthcare-and-life-sciences-market-size-and-forecast/
Explore at:
Dataset updated
Sep 8, 2024
Dataset authored and provided by
Market Research Intellect
License
https://www.marketresearchintellect.com/de/privacy-policyhttps://www.marketresearchintellect.com/de/privacy-policy
Area covered
Global
Description
Die Marktgröße und der Anteil sind kategorisiert nach Clinical Documentation (Automated Transcription, Data Extraction, Clinical Decision Support, Patient Summarization, Coding Assistance) and Patient Interaction (Chatbots, Sentiment Analysis, Virtual Health Assistants, Appointment Scheduling, Patient Feedback Analysis) and Drug Discovery (Literature Mining, Biomarker Identification, Clinical Trial Data Analysis, Adverse Event Detection, Patient Stratification) and Healthcare Operations (Revenue Cycle Management, Fraud Detection, Supply Chain Management, Predictive Analytics, Resource Management) and Research and Development (Clinical Research, Data Mining, Patient Cohort Identification, Real-World Evidence Generation, Market Access and Value Demonstration) and geografischen Regionen (Nordamerika, Europa, Asien-Pazifik, Südamerika, Naher Osten & Afrika)
m
تحليل الحصة والحجم والصناعة العالمية معالجة اللغة الطبيعية NLP في سوق...
marketresearchintellect.com
Updated Jun 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ماركت ريسيرش إنتيليكت (2025). تحليل الحصة والحجم والصناعة العالمية معالجة اللغة الطبيعية NLP في سوق الرعاية الصحية وعلوم الحياة حتى عام 2033 [Dataset]. https://www.marketresearchintellect.com/ar/product/natural-language-processing-nlp-in-healthcare-and-life-sciences-market-size-and-forecast/
Explore at:
Dataset updated
Jun 4, 2025
Dataset authored and provided by
ماركت ريسيرش إنتيليكت
License
https://www.marketresearchintellect.com/ar/privacy-policyhttps://www.marketresearchintellect.com/ar/privacy-policy
Area covered
Global
Description
تم تصنيف حجم وحصة السوق حسب Clinical Documentation (Automated Transcription, Data Extraction, Clinical Decision Support, Patient Summarization, Coding Assistance) and Patient Interaction (Chatbots, Sentiment Analysis, Virtual Health Assistants, Appointment Scheduling, Patient Feedback Analysis) and Drug Discovery (Literature Mining, Biomarker Identification, Clinical Trial Data Analysis, Adverse Event Detection, Patient Stratification) and Healthcare Operations (Revenue Cycle Management, Fraud Detection, Supply Chain Management, Predictive Analytics, Resource Management) and Research and Development (Clinical Research, Data Mining, Patient Cohort Identification, Real-World Evidence Generation, Market Access and Value Demonstration) and المناطق الجغرافية (أمريكا الشمالية، أوروبا، آسيا والمحيط الهادئ، أمريكا الجنوبية، الشرق الأوسط وأفريقيا)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Column Name	Description	Datatype
Topic name	Name of topic passage represents	String

Facebook

Twitter

Click to copy link

Link copied

Cite

Datasimple (2025). OpenAI Summarization Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/f95cfdab-cfe3-46a3-91a4-e5d8f15dcf15

OpenAI Summarization Corpus

Explore at:

.undefinedAvailable download formats

Dataset updated

Jun 12, 2025

Dataset authored and provided by

Datasimple

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered

Data Science and Analytics

Description

This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

To use this dataset for summarization tasks:

Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry.. Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content Research Ideas Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset. Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance. Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

Original Data Source: OpenAI Summarization Corpus

Clear search

Close search

Google apps

Main menu

OpenAI Summarization Corpus

Data for: Extractive Summarization of Clinical Trial Descriptions

Data from: Slovenian text summarization models

Indonesian Travel Reviews for Text Summarization

Question-Answering Training and Testing Data

Table_3_Deriving comprehensive literature trends on multi-omics analysis...

SnipGen: A Mining Repository Framework for Evaluating LLMs for Code

Table_1_Deriving comprehensive literature trends on multi-omics analysis...

QA4MRE (Reading Comprehension Q&A)

License

Trustpilot Reviews Dataset – 2 Million+ Real Customer Feedback Records

Trustpilot Reviews Dataset – Over 2 Million Real Customer Reviews

📌 Key Features

Use Cases

🔗 Related Resource

tldr-17

Globale Verarbeitung natürlicher Sprache NLP im Markt für Gesundheits- und...

تحليل الحصة والحجم والصناعة العالمية معالجة اللغة الطبيعية NLP في سوق...

OpenAI Summarization Corpus