This dataset is for summarization of video transcript.This contains transcript of videos from 26 different categories. E.g. Sports,Education,Medical,News,etc.
Context and Inspiration
This dataset was developed for a project focused on enhancing Amharic text summarization using parameter-efficient fine-tuning of the mT5-small model. Recognizing a gap in standardization within Amharic text summarization, a part of the project's goal was to establish a benchmark dataset to facilitate future research and evaluation, thereby advancing Amharic NLP. The dataset's creation entailed comprehensive processes of data collection, aggregation, cleaning, and preprocessing.
Sources
The datasets are based on the following key sources:
[1] Amharic Abstractive Text Summarization by Amr M. Zaki, Mahmoud I. Khalil, and Hazem M. Abbas, 2020.
[2] XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages by Tahmid Hasan et al., ACL-IJCNLP 2021.
[3] An Amharic News Text Classification Dataset by Israel Abebe Azime and Nebil Mohammed, 2021.
Data Collection
The dataset compiles Amharic text from the various studies listed above that are publicly available. These datasets were available in CSV and JSONL formats.
Data Aggregation and Cleaning
The Amharic datasets from these sources were aggregated, ensuring uniform column names for consistency. This process formed the basis for the following steps:
- Initial Cleaning: Removing duplicates and NaN entries, and filtering out non-Amharic texts.
- Character-Level Normalization: Standardizing variations in the Amharic script.
- Removal of Non-Amharic Elements: Eliminating non-Amharic characters and specific punctuations.
- Size Optimization: Setting token count thresholds to manage entry sizes, considering computing resource limitations.
Dataset Versions
Three versions of the dataset emerged from these processes:
- Amharic-1: Formed after initial cleaning.
- Amharic-2: Developed by adjusting entry lengths and establishing bounds for text and summary lengths based on fine-tuning outcomes.
- Amharic-3: Further refined from Amharic-2 using additional cleaning steps informed by common Amharic text preprocessing practices.
Each version of the dataset was split into training, validation, and test sets in an 80%, 10%, and 10% distribution, respectively, to support effective model training and evaluation.
Columns - text: The main body of the Amharic text. - summary: Condensed version or summary of the text. - is_non_amharic_text: Indicator if the text is non-Amharic (True/False). - is_non_amharic_summary: Indicator if the summary is non-Amharic (True/False). - text_length: Length of the text in characters. - summary_length: Length of the summary in characters. - text_word_count: Number of words in the text. - summary_word_count: Number of words in the summary. - text_token_count: Count of tokens in the text. - summary_token_count: Count of tokens in the summary.
For an in-depth understanding of the dataset and the methodologies used, please consult the final report: Link to Final Report
For access to the specific coding details of the modifications, refer to the following notebook: Link to Notebook.
This dataset aims to serve as a valuable resource for researchers and practitioners focusing on low-resource languages like Amharic.
The Email Thread Dataset consists of two main files: email_thread_details
and email_thread_summaries
. These files collectively offer a comprehensive compilation of email thread information alongside human-generated summaries.
The email_thread_details file provides a detailed perspective on individual email threads, encompassing crucial information such as subject, timestamp, sender, recipients, and the content of the email.
thread_id
: A unique identifier for each email thread.subject
: Subject of the email thread.timestamp
: Timestamp indicating when the message was sent.from
: Sender of the email.to
: List of recipients of the email.body
: Content of the email message.The "to
" column is available in both CSV and Pickle (pkl) formats, facilitating convenient access to recipient information as a column of lists of strings.
The email_thread_summaries file contains concise summaries crafted by human annotators for each email thread, offering a high-level overview of the content.
thread_id
: A unique identifier for each email thread.summary
: A concise summary of the email thread.The dataset is organized into threads and emails. There are a total of 4,167 threads and 21,684 emails, providing a rich source of information for analysis and research purposes.
JSON Files:
****JSON File Features Description****
[
{
"thread_id": [unique identifier],
"subject": "[email thread subject]",
"timestamp": [timestamp in milliseconds],
"from": "[sender's name and identifier]",
"to": [
"[recipient 1]",
"[recipient 2]",
"[recipient 3]",
...
],
"body": "[email content]"
},
...
]
[
{
"thread_id": [unique identifier],
"summary": "[summary content]"
},
...
]
- Dataset
├── CSV
│ ├── email_thread_details.csv
│ └── email_thread_summaries.csv
├── Pickle
│ ├── email_thread_details.pkl
│ └── email_thread_summaries.pkl
└── JSON
├── email_thread_details.json
└── email_thread_summaries.json
This dataset is provided under the MIT License.
The dataset has been anonymized and sanitized to ensure privacy and confidentiality.
https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/
Content
This dataset was created from three datasets:
BBC News Summary CNN-DailyMail News Text Summarization Generated text from LLM models
Than was create Kaggle dataset:
Text for summarize NLP/LLM task (En)
The dataset was filtered, shuffled, and divided into parts before being saved to Hugging Face.
Acknowledgements
The dataset was created with the support of resources from SV metal spol. s r.o.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This work is accepted at TCCE-2020. The paper is available at springer AISC proceedings: https://doi.org/10.1007/978-981-33-4673-4_4 arxiv: https://arxiv.org/pdf/2012.01747.pdf
Nowadays news or text summarization becomes very popular in the NLP field. Both the extractive and abstractive approaches of summarization are implemented in different languages. A significant amount of data is a primary need for any summarization. For the Bengali language, there are only a few datasets are available. Our dataset is made for Bengali Abstractive News Summarization (BANS) purposes. As abstractive summarization is basically neural network-based it needs more and more data to perform well. So we made a standard Bengali abstractive summarization data crawling from online Bengali news portal bangla.bdnews24.com. We crawled more than 19k articles and summaries and standardized the data.
List of files: 1. article.txt 2. summary.txt
Description | Data Info. |
---|---|
Total no of articles | 19096 |
Total no of summaries | 19096 |
Maximum no of words in an article | 76 |
Maximum no of words in a summary | 12 |
Minimum no of words in an article | 5 |
Minimum no of words in a summary | 3 |
We would like to thank Shahjalal University of Science and Technology (SUST) research center and SUST NLP research group for their support.
@InProceedings{10.1007/978-981-33-4673-4_4, author="Bhattacharjee, Prithwiraj and Mallick, Avi and Saiful Islam, Md. and Marium-E-Jannat", editor="Kaiser, M. Shamim and Bandyopadhyay, Anirban and Mahmud, Mufti and Ray, Kanad", title="Bengali Abstractive News Summarization (BANS): A Neural Attention Approach", booktitle="Proceedings of International Conference on Trends in Computational and Cognitive Engineering", year="2021", publisher="Springer Singapore", address="Singapore", pages="41--51", abstract="Bhattacharjee, PrithwirajMallick, AviSaiful Islam, Md.Marium-E-JannatAbstractive summarization is the process of generating novel sentences based on the information extracted from the original text document while retaining the context. Due to abstractive summarization's underlying complexities, most of the past research work has been done on the extractive summarization approach. Nevertheless, with the triumph of the sequence-to-sequence (seq2seq) model, abstractive summarization becomes more viable. Although a significant number of notable research has been done in the English language based on abstractive summarization, only a couple of works have been done on Bengali abstractive news summarization (BANS). In this article, we presented a seq2seq based Long Short-Term Memory (LSTM) network model with attention at encoder-decoder. Our proposed system deploys a local attention-based model that produces a long sequence of words with lucid and human-like generated sentences with noteworthy information of the original document. We also prepared a dataset of more than 19 k articles and corresponding human-written summaries collected from bangla.bdnews24.com (https://bangla.bdnews24.com/) which is till now the most extensive dataset for Bengali news document summarization and publicly published in Kaggle (https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset) We evaluated our model qualitatively and quantitatively and compared it with other published results. It showed significant improvement in terms of human evaluation scores with state-of-the-art approaches for BANS.", isbn="978-981-33-4673-4" }
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.
Context: The g3-sum dataset is created for the purpose of developing and evaluating models for abstractive summarization of Urdu talk show scripts. It provides a structured dataset to facilitate research and application in natural language processing**(NLP)** for the Urdu language.
Sources: The training data is collected from YouTube scripts of Urdu talk shows, capturing dialogues and discussions from various programs. The dataset includes both scripts and their corresponding human-written summaries.
Inspiration: The inspiration behind this dataset is to improve the accessibility and understanding of lengthy Urdu talk show content by creating concise and accurate summaries. This project aims to enhance the usability of Urdu text data for both researchers and end-users by leveraging advanced summarization techniques.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.
The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.
The dataset consists of 5 CSV files.
1. CNN_DailyMail.csv
: Contains all processed news articles.
2. Gutenberg.csv
: Contains all processed books.
3. Wikipedia.csv
: Contains all processed Wikipedia articles.
4. Human.csv
: Combines all three datasets in order.
5. Shuffled_Human.csv
: This is the randomly shuffled version of Human.csv
.
Each file has 2 columns:
- Title
: The title of the item.
- Text
: The content of the item.
This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.
While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.
For details on how the dataset was created, click here to view the Kaggle notebook used.
This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)
The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.
To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!
Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!
By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!
- Creating a text classification algorithm to automatically categorize short stories by genre.
- Developing an AI-based summarization tool to quickly summarize the main points in a story.
- Developing an AI-based story generator that can generate new stories based on existing ones in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary Review data and Listing ID (to facilitate time-based analytics and visualizations linked to a listing). This dataset can be used for NLP usecases, e.g. Exploratory Data Analysis (EDA), Text summarization, sentiment analysis, intent analysis and many more.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.
If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus, but if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.
There is no hard and fast rule on when to remove stop words. But I would suggest removing stop words if our task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification. On the other hand, if our task is one of Machine Translation, Question-Answering problems, Text Summarization, Language Modeling, it’s better not to remove the stop words as they are a crucial part of these applications.
One of the first things that we ask ourselves is what are the pros and cons of any task we perform. Let’s look at some of the pros and cons of stop word removal in NLP.
Improper selection and removal of stop words can change the meaning of our text. So we have to be careful in choosing our stop words. Ex: “ This movie is not good.” If we remove (not) in pre-processing step the sentence (this movie is good) indicates that it is positive which is wrongly interpreted.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains questions and answers related to injection molding, focusing on topics such as 'Materials', 'Techniques', 'Machinery', 'Troubleshooting', 'Safety','Design','Maintenance','Manufacturing','Development','R&D'. The dataset is provided in CSV format with two columns: Questions and Answers.
Researchers, practitioners, and enthusiasts in the field of injection molding can utilize this dataset for tasks such as:
import pandas as pd
# Load the dataset
dataset = pd.read_csv('injection_molds_dataset.csv')
# Display the first few rows
print(dataset. Head())
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("mustafakeser/injection-molding-QA")
# Display dataset info
print(dataset)
# Accessing the first few examples
print(dataset['train'][:5])
#or
dataset['train'].to_pandas()
If you use this dataset in your work, please consider citing it as:
@misc{injectionmold_dataset,
author = {Your Name},
title = {Injection Molds Dataset},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Datasets},
howpublished = {\url{link to the dataset}},
}
https://huggingface.co/datasets/mustafakeser/injection-molding-QA mustafakeser/injection-molding-QA
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains information about different books that are sold over the Internet. It can be used for multiple NLP tasks using sentimental analysis, text classification, summary generation, recommendation system generation and many more.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The Walmart Customer Reviews Dataset offers a wealth of insights into consumer sentiment and product feedback related to one of the world's largest retail giants, Walmart. This dataset contains a vast collection of customer reviews, star ratings, and other relevant information that has been gathered through web scraping and data compilation.
Key Features: - Customer Reviews: Detailed textual reviews provide firsthand accounts of shopping experiences and product satisfaction. - Star Ratings: Each review is accompanied by a star rating, allowing for sentiment analysis and product rating assessment. - Review Dates: The dataset includes review submission dates, facilitating temporal analysis and trend detection. - Product Identification: For some reviews, product identification details such as SKU numbers or product categories are provided.
Use Cases:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains news headlines collected from 20 major Indonesian news portals through web scraping conducted on February 23, 2025. The dataset is structured into three key components: the source of the news, the headline title, and the date of publication. By compiling headlines from multiple sources, this dataset provides a comprehensive snapshot of trending topics across different media outlets in Indonesia. It can be utilized for various analytical and research purposes, such as trending topic analysis, sentiment analysis, and natural language processing (NLP) applications. Researchers can use this dataset to track public sentiment, identify recurring themes in news coverage, and train machine learning models for text-based tasks such as classification, keyword extraction, and summarization.
With 1,174 rows and 3 columns, this dataset contains no missing values, ensuring its usability for data analysis and modeling. The three available variables are: source
, which represents the name of the news portal where the headline was published; title
, which contains the actual headline of the news article; and date
, which indicates the publication date of each news piece. These variables make it possible to conduct media monitoring, study media bias, and compare how different news platforms report on similar topics. Additionally, the dataset is valuable for time-series analysis, allowing users to observe how news trends evolve over time.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset is for summarization of video transcript.This contains transcript of videos from 26 different categories. E.g. Sports,Education,Medical,News,etc.