15 datasets found

DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION
kaggle.com
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Magdum (2023). DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION [Dataset]. https://www.kaggle.com/datasets/prasadmagdum/nlpproject
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prasad Magdum
Description
This dataset is for summarization of video transcript.This contains transcript of videos from 26 different categories. E.g. Sports,Education,Medical,News,etc.
Benchmark Dataset for Amharic Text Summarization
kaggle.com
Updated Jan 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Mekuriaw (2024). Benchmark Dataset for Amharic Text Summarization [Dataset]. https://www.kaggle.com/datasets/danielmekuriaw/benchmark-dataset-for-amharic-text-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Daniel Mekuriaw
Description
Context and Inspiration
This dataset was developed for a project focused on enhancing Amharic text summarization using parameter-efficient fine-tuning of the mT5-small model. Recognizing a gap in standardization within Amharic text summarization, a part of the project's goal was to establish a benchmark dataset to facilitate future research and evaluation, thereby advancing Amharic NLP. The dataset's creation entailed comprehensive processes of data collection, aggregation, cleaning, and preprocessing.

Sources
The datasets are based on the following key sources: [1] Amharic Abstractive Text Summarization by Amr M. Zaki, Mahmoud I. Khalil, and Hazem M. Abbas, 2020. [2] XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages by Tahmid Hasan et al., ACL-IJCNLP 2021. [3] An Amharic News Text Classification Dataset by Israel Abebe Azime and Nebil Mohammed, 2021.

Data Collection
The dataset compiles Amharic text from the various studies listed above that are publicly available. These datasets were available in CSV and JSONL formats.

Data Aggregation and Cleaning
The Amharic datasets from these sources were aggregated, ensuring uniform column names for consistency. This process formed the basis for the following steps: - Initial Cleaning: Removing duplicates and NaN entries, and filtering out non-Amharic texts. - Character-Level Normalization: Standardizing variations in the Amharic script. - Removal of Non-Amharic Elements: Eliminating non-Amharic characters and specific punctuations. - Size Optimization: Setting token count thresholds to manage entry sizes, considering computing resource limitations.

Dataset Versions
Three versions of the dataset emerged from these processes: - Amharic-1: Formed after initial cleaning. - Amharic-2: Developed by adjusting entry lengths and establishing bounds for text and summary lengths based on fine-tuning outcomes. - Amharic-3: Further refined from Amharic-2 using additional cleaning steps informed by common Amharic text preprocessing practices.

Each version of the dataset was split into training, validation, and test sets in an 80%, 10%, and 10% distribution, respectively, to support effective model training and evaluation.

Columns - text: The main body of the Amharic text. - summary: Condensed version or summary of the text. - is_non_amharic_text: Indicator if the text is non-Amharic (True/False). - is_non_amharic_summary: Indicator if the summary is non-Amharic (True/False). - text_length: Length of the text in characters. - summary_length: Length of the summary in characters. - text_word_count: Number of words in the text. - summary_word_count: Number of words in the summary. - text_token_count: Count of tokens in the text. - summary_token_count: Count of tokens in the summary.

For an in-depth understanding of the dataset and the methodologies used, please consult the final report: Link to Final Report

For access to the specific coding details of the modifications, refer to the following notebook: Link to Notebook.

This dataset aims to serve as a valuable resource for researchers and practitioners focusing on low-resource languages like Amharic.
Email Thread Summary Dataset
kaggle.com
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marawan Mamdouh (2023). Email Thread Summary Dataset [Dataset]. https://www.kaggle.com/datasets/marawanxmamdouh/email-thread-summary-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marawan Mamdouh
Description
Email Thread Summary Dataset

Overview:

The Email Thread Dataset consists of two main files: email_thread_details and email_thread_summaries. These files collectively offer a comprehensive compilation of email thread information alongside human-generated summaries.

Email Thread Details:

Description:

The email_thread_details file provides a detailed perspective on individual email threads, encompassing crucial information such as subject, timestamp, sender, recipients, and the content of the email.

Columns:

thread_id: A unique identifier for each email thread.

subject: Subject of the email thread.

timestamp: Timestamp indicating when the message was sent.

from: Sender of the email.

to: List of recipients of the email.

body: Content of the email message.

Additional Information:

The "to" column is available in both CSV and Pickle (pkl) formats, facilitating convenient access to recipient information as a column of lists of strings.

Email Thread Summaries:

Description:

The email_thread_summaries file contains concise summaries crafted by human annotators for each email thread, offering a high-level overview of the content.

Columns:

thread_id: A unique identifier for each email thread.

summary: A concise summary of the email thread.

Dataset Structure:

The dataset is organized into threads and emails. There are a total of 4,167 threads and 21,684 emails, providing a rich source of information for analysis and research purposes.

Threads: 4,167 threads

Emails: 21,684 emails

Language:

Languages: English (en)

Use Cases:

Natural Language Processing (NLP) Research:

Analyze email thread contents and human-generated summaries for advancements in NLP tasks.

Text Summarization Models:

Train and evaluate text summarization models using the provided email threads and summaries.

Email Analytics:

Gain insights into communication patterns, sender-receiver relationships, and content analysis.

File Formats:

CSV Files:

Easily importable into various data analysis tools.

Pickle (pkl) Files:

Facilitates direct reading of the "to" column as a column of lists of strings.

JSON Files:

Offers compatibility with JSON data structures, providing an additional option for users who prefer or require this widely-used format in their analytical workflows.

****JSON File Features Description****

[ { "thread_id": [unique identifier], "subject": "[email thread subject]", "timestamp": [timestamp in milliseconds], "from": "[sender's name and identifier]", "to": [ "[recipient 1]", "[recipient 2]", "[recipient 3]", ... ], "body": "[email content]" }, ... ]

[ { "thread_id": [unique identifier], "summary": "[summary content]" }, ... ]

****Files Structure:****

- Dataset ├── CSV │ ├── email_thread_details.csv │ └── email_thread_summaries.csv ├── Pickle │ ├── email_thread_details.pkl │ └── email_thread_summaries.pkl └── JSON ├── email_thread_details.json └── email_thread_summaries.json

License:

This dataset is provided under the MIT License.

Disclaimer:

The dataset has been anonymized and sanitized to ensure privacy and confidentiality.
h
summary_dataset_en
huggingface.co
Updated Mar 23, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radim Közl (2009). summary_dataset_en [Dataset]. https://huggingface.co/datasets/KRadim/summary_dataset_en
Explore at:
Dataset updated
Mar 23, 2009
Authors
Radim Közl
License
https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/
Description
Content

This dataset was created from three datasets:

BBC News Summary CNN-DailyMail News Text Summarization Generated text from LLM models

Than was create Kaggle dataset:

Text for summarize NLP/LLM task (En)

The dataset was filtered, shuffled, and divided into parts before being saved to Hugging Face.

Acknowledgements

The dataset was created with the support of resources from SV metal spol. s r.o.

Bengali News Summarization Dataset

kaggle.com

Updated Sep 30, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

PrithwirajSust (2020). Bengali News Summarization Dataset [Dataset]. https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 30, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

PrithwirajSust

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

BANSData: A Dataset for Bengali Abstractive News Summarization

This work is accepted at TCCE-2020. The paper is available at springer AISC proceedings: https://doi.org/10.1007/978-981-33-4673-4_4 arxiv: https://arxiv.org/pdf/2012.01747.pdf

Abstract

Nowadays news or text summarization becomes very popular in the NLP field. Both the extractive and abstractive approaches of summarization are implemented in different languages. A significant amount of data is a primary need for any summarization. For the Bengali language, there are only a few datasets are available. Our dataset is made for Bengali Abstractive News Summarization (BANS) purposes. As abstractive summarization is basically neural network-based it needs more and more data to perform well. So we made a standard Bengali abstractive summarization data crawling from online Bengali news portal bangla.bdnews24.com. We crawled more than 19k articles and summaries and standardized the data.

List of files: 1. article.txt 2. summary.txt

Dataset Description

Description	Data Info.
Total no of articles	19096
Total no of summaries	19096
Maximum no of words in an article	76
Maximum no of words in a summary	12
Minimum no of words in an article	5
Minimum no of words in a summary	3

Acknowledgement

We would like to thank Shahjalal University of Science and Technology (SUST) research center and SUST NLP research group for their support.

Bibtex for Citation

@InProceedings{10.1007/978-981-33-4673-4_4, author="Bhattacharjee, Prithwiraj and Mallick, Avi and Saiful Islam, Md. and Marium-E-Jannat", editor="Kaiser, M. Shamim and Bandyopadhyay, Anirban and Mahmud, Mufti and Ray, Kanad", title="Bengali Abstractive News Summarization (BANS): A Neural Attention Approach", booktitle="Proceedings of International Conference on Trends in Computational and Cognitive Engineering", year="2021", publisher="Springer Singapore", address="Singapore", pages="41--51", abstract="Bhattacharjee, PrithwirajMallick, AviSaiful Islam, Md.Marium-E-JannatAbstractive summarization is the process of generating novel sentences based on the information extracted from the original text document while retaining the context. Due to abstractive summarization's underlying complexities, most of the past research work has been done on the extractive summarization approach. Nevertheless, with the triumph of the sequence-to-sequence (seq2seq) model, abstractive summarization becomes more viable. Although a significant number of notable research has been done in the English language based on abstractive summarization, only a couple of works have been done on Bengali abstractive news summarization (BANS). In this article, we presented a seq2seq based Long Short-Term Memory (LSTM) network model with attention at encoder-decoder. Our proposed system deploys a local attention-based model that produces a long sequence of words with lucid and human-like generated sentences with noteworthy information of the original document. We also prepared a dataset of more than 19 k articles and corresponding human-written summaries collected from bangla.bdnews24.com (https://bangla.bdnews24.com/) which is till now the most extensive dataset for Bengali news document summarization and publicly published in Kaggle (https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset) We evaluated our model qualitatively and quantitatively and compared it with other published results. It showed significant improvement in terms of human evaluation scores with state-of-the-art approaches for BANS.", isbn="978-981-33-4673-4" }

Legal Text Classification Dataset
kaggle.com
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A.Mohan kumar (2023). Legal Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
A.Mohan kumar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.
g3_sum
kaggle.com
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Messam Naqvi (2024). g3_sum [Dataset]. https://www.kaggle.com/datasets/messamnaqvi/g3-sum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Messam Naqvi
Description
Context: The g3-sum dataset is created for the purpose of developing and evaluating models for abstractive summarization of Urdu talk show scripts. It provides a structured dataset to facilitate research and application in natural language processing**(NLP)** for the Urdu language.

Sources: The training data is collected from YouTube scripts of Urdu talk shows, capturing dialogues and discussions from various programs. The dataset includes both scripts and their corresponding human-written summaries.

Inspiration: The inspiration behind this dataset is to improve the accessibility and understanding of lengthy Urdu talk show content by creating concise and accurate summaries. This project aims to enhance the usability of Urdu text data for both researchers and end-users by leveraging advanced summarization techniques.
Human Written Text
kaggle.com
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youssef Elebiary (2025). Human Written Text [Dataset]. https://www.kaggle.com/datasets/youssefelebiary/human-written-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2025
Dataset provided by
Kaggle
Authors
Youssef Elebiary
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.

Data Source Distribution

10,000 Wikipedia Articles: From the 20220301 dump.

3,000 Gutenberg Books: Via the GutenDex API.

7,000 CNN/DailyMail News Articles: From the CNN/DailyMail 3.0.0 dataset.

Why These Sources

The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.

Dataset Structure

The dataset consists of 5 CSV files. 1. CNN_DailyMail.csv: Contains all processed news articles. 2. Gutenberg.csv: Contains all processed books. 3. Wikipedia.csv: Contains all processed Wikipedia articles. 4. Human.csv: Combines all three datasets in order. 5. Shuffled_Human.csv: This is the randomly shuffled version of Human.csv.

Each file has 2 columns: - Title: The title of the item. - Text: The content of the item.

Uses

This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.

Disclaimer

While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.

For details on how the dataset was created, click here to view the Kaggle notebook used.

Licensing

This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.
TinyStories
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

By Huggingface Hub [source]

About this dataset

This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

Research Ideas

Creating a text classification algorithm to automatically categorize short stories by genre.

Developing an AI-based summarization tool to quickly summarize the main points in a story.

Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

File: train.csv | Column name | Description | |:--------------|:----------------------------...
AirBNB reviews Dataset
kaggle.com
Updated Jan 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Ahmed Ansari (2023). AirBNB reviews Dataset [Dataset]. https://www.kaggle.com/datasets/muhammadahmedansari/airbnb-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Ahmed Ansari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary Review data and Listing ID (to facilitate time-based analytics and visualizations linked to a listing). This dataset can be used for NLP usecases, e.g. Exploratory Data Analysis (EDA), Text summarization, sentiment analysis, intent analysis and many more.
Stop words in 28 languages
kaggle.com
Updated Sep 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heeral Dedhia (2020). Stop words in 28 languages [Dataset]. https://www.kaggle.com/heeraldedhia/stop-words-in-28-languages/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Heeral Dedhia
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
STOPWORDS

Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.

When to remove stop words?

If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus, but if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.

There is no hard and fast rule on when to remove stop words. But I would suggest removing stop words if our task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification. On the other hand, if our task is one of Machine Translation, Question-Answering problems, Text Summarization, Language Modeling, it’s better not to remove the stop words as they are a crucial part of these applications.

Pros and Cons:

One of the first things that we ask ourselves is what are the pros and cons of any task we perform. Let’s look at some of the pros and cons of stop word removal in NLP.

pros:

Stop words are often removed from the text before training deep learning and machine learning models since stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.

On removing stopwords, dataset size decreases, and the time to train the model also decreases without a huge impact on the accuracy of the model.

Stopword removal can potentially help in improving performance, as there are fewer and only significant tokens left. Thus, the classification accuracy could be improved

cons:

Improper selection and removal of stop words can change the meaning of our text. So we have to be careful in choosing our stop words. Ex: “ This movie is not good.” If we remove (not) in pre-processing step the sentence (this movie is good) indicates that it is positive which is wrongly interpreted.

Available languages

Arabic

Bulgarian

Catalan

Czech

Danish

Dutch

English

Finnish

French

German

Gujarati

Hindi

Hebrew

Hungarian

Indonesian

Malaysian

Italian

Norwegian

Polish

Portuguese

Romanian

Russian

Slovak

Spanish

Swedish

Turkish

Ukrainian

Vietnamese
injection-molding-QA
kaggle.com
huggingface.co
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Keser (2024). injection-molding-QA [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/injection-molding-qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset provided by
Kaggle
Authors
Mustafa Keser
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
injection-molding-QA

Description

This dataset contains questions and answers related to injection molding, focusing on topics such as 'Materials', 'Techniques', 'Machinery', 'Troubleshooting', 'Safety','Design','Maintenance','Manufacturing','Development','R&D'. The dataset is provided in CSV format with two columns: Questions and Answers.

Usage

Researchers, practitioners, and enthusiasts in the field of injection molding can utilize this dataset for tasks such as:

Natural Language Processing (NLP) tasks such as question answering, text generation, and summarization.

Training and evaluation of machine learning models for understanding and generating responses related to injection molding.

Example pandas

import pandas as pd # Load the dataset dataset = pd.read_csv('injection_molds_dataset.csv') # Display the first few rows print(dataset. Head())

Example datasets

from datasets import load_dataset # Load the dataset dataset = load_dataset("mustafakeser/injection-molding-QA") # Display dataset info print(dataset) # Accessing the first few examples print(dataset['train'][:5]) #or dataset['train'].to_pandas()

Columns

Questions: Contains questions related to injection molding.

Answers: Provides detailed answers to the corresponding questions.

Citation

If you use this dataset in your work, please consider citing it as:

@misc{injectionmold_dataset, author = {Your Name}, title = {Injection Molds Dataset}, year = {2024}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, howpublished = {\url{link to the dataset}}, }

Huggingface

https://huggingface.co/datasets/mustafakeser/injection-molding-QA mustafakeser/injection-molding-QA

Notes

This dataset curated with gemini-1.0-pro

license: apache-2.0
Data from: Books Information Dataset
kaggle.com
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranav Jadhav (2024). Books Information Dataset [Dataset]. https://www.kaggle.com/datasets/jadhavpranav/books-information-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pranav Jadhav
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains information about different books that are sold over the Internet. It can be used for multiple NLP tasks using sentimental analysis, text classification, summary generation, recommendation system generation and many more.
Walmart Reviews Dataset
kaggle.com
Updated Sep 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshal H (2023). Walmart Reviews Dataset [Dataset]. https://www.kaggle.com/harshalhonde/walmart-reviews-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Harshal H
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The Walmart Customer Reviews Dataset offers a wealth of insights into consumer sentiment and product feedback related to one of the world's largest retail giants, Walmart. This dataset contains a vast collection of customer reviews, star ratings, and other relevant information that has been gathered through web scraping and data compilation.

Key Features: - Customer Reviews: Detailed textual reviews provide firsthand accounts of shopping experiences and product satisfaction. - Star Ratings: Each review is accompanied by a star rating, allowing for sentiment analysis and product rating assessment. - Review Dates: The dataset includes review submission dates, facilitating temporal analysis and trend detection. - Product Identification: For some reviews, product identification details such as SKU numbers or product categories are provided.

Use Cases:

Sentiment Analysis: Researchers and data analysts can perform sentiment analysis to understand customer sentiment toward specific products, categories, or Walmart as a whole.

Product Quality Assessment: Assess the quality and performance of Walmart products based on customer feedback.

Market Research: Gain insights into consumer preferences and trends in the retail industry.

NLP Applications: Utilize the textual reviews for natural language processing (NLP) tasks such as text classification, summarization, and topic modelling.
Indonesia News Portal Headlines Dataset
kaggle.com
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayesq Prameswari (2025). Indonesia News Portal Headlines Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10831968
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10831968
Dataset updated
Feb 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mayesq Prameswari
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains news headlines collected from 20 major Indonesian news portals through web scraping conducted on February 23, 2025. The dataset is structured into three key components: the source of the news, the headline title, and the date of publication. By compiling headlines from multiple sources, this dataset provides a comprehensive snapshot of trending topics across different media outlets in Indonesia. It can be utilized for various analytical and research purposes, such as trending topic analysis, sentiment analysis, and natural language processing (NLP) applications. Researchers can use this dataset to track public sentiment, identify recurring themes in news coverage, and train machine learning models for text-based tasks such as classification, keyword extraction, and summarization.

With 1,174 rows and 3 columns, this dataset contains no missing values, ensuring its usability for data analysis and modeling. The three available variables are: source, which represents the name of the news portal where the headline was published; title, which contains the actual headline of the news article; and date, which indicates the publication date of each news piece. These variables make it possible to conduct media monitoring, study media bias, and compare how different news platforms report on similar topics. Additionally, the dataset is valuable for time-series analysis, allowing users to observe how news trends evolve over time.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Prasad Magdum (2023). DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION [Dataset]. https://www.kaggle.com/datasets/prasadmagdum/nlpproject

DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 1, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Prasad Magdum

Description

This dataset is for summarization of video transcript.This contains transcript of videos from 26 different categories. E.g. Sports,Education,Medical,News,etc.

Clear search

Close search

Google apps

Main menu

DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION

Benchmark Dataset for Amharic Text Summarization

Email Thread Summary Dataset

Email Thread Summary Dataset

Overview:

Email Thread Details:

Description:

Columns:

Additional Information:

Email Thread Summaries:

Description:

Columns:

Dataset Structure:

Language:

Use Cases:

File Formats:

****Files Structure:****

License:

Disclaimer:

summary_dataset_en

Bengali News Summarization Dataset

BANSData: A Dataset for Bengali Abstractive News Summarization

Abstract

Dataset Description

Acknowledgement

Bibtex for Citation

Legal Text Classification Dataset

g3_sum

Human Written Text

Overview

Data Source Distribution

Why These Sources

Dataset Structure

Uses

Disclaimer

Licensing

TinyStories

TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

AirBNB reviews Dataset

Stop words in 28 languages

STOPWORDS

When to remove stop words?

Pros and Cons:

pros:

cons:

Available languages

injection-molding-QA

injection-molding-QA

Description

Usage

Example pandas

Example datasets

Columns

Citation

Huggingface

Notes

This dataset curated with gemini-1.0-pro

license: apache-2.0

Data from: Books Information Dataset

Walmart Reviews Dataset

Indonesia News Portal Headlines Dataset

DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION

Files Structure: