15 datasets found
  1. DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION

    • kaggle.com
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasad Magdum (2023). DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION [Dataset]. https://www.kaggle.com/datasets/prasadmagdum/nlpproject
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prasad Magdum
    Description

    This dataset is for summarization of video transcript.This contains transcript of videos from 26 different categories. E.g. Sports,Education,Medical,News,etc.

  2. Benchmark Dataset for Amharic Text Summarization

    • kaggle.com
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Mekuriaw (2024). Benchmark Dataset for Amharic Text Summarization [Dataset]. https://www.kaggle.com/datasets/danielmekuriaw/benchmark-dataset-for-amharic-text-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Daniel Mekuriaw
    Description

    Context and Inspiration
    This dataset was developed for a project focused on enhancing Amharic text summarization using parameter-efficient fine-tuning of the mT5-small model. Recognizing a gap in standardization within Amharic text summarization, a part of the project's goal was to establish a benchmark dataset to facilitate future research and evaluation, thereby advancing Amharic NLP. The dataset's creation entailed comprehensive processes of data collection, aggregation, cleaning, and preprocessing.

    Sources
    The datasets are based on the following key sources: [1] Amharic Abstractive Text Summarization by Amr M. Zaki, Mahmoud I. Khalil, and Hazem M. Abbas, 2020. [2] XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages by Tahmid Hasan et al., ACL-IJCNLP 2021. [3] An Amharic News Text Classification Dataset by Israel Abebe Azime and Nebil Mohammed, 2021.

    Data Collection
    The dataset compiles Amharic text from the various studies listed above that are publicly available. These datasets were available in CSV and JSONL formats.

    Data Aggregation and Cleaning
    The Amharic datasets from these sources were aggregated, ensuring uniform column names for consistency. This process formed the basis for the following steps: - Initial Cleaning: Removing duplicates and NaN entries, and filtering out non-Amharic texts. - Character-Level Normalization: Standardizing variations in the Amharic script. - Removal of Non-Amharic Elements: Eliminating non-Amharic characters and specific punctuations. - Size Optimization: Setting token count thresholds to manage entry sizes, considering computing resource limitations.

    Dataset Versions
    Three versions of the dataset emerged from these processes: - Amharic-1: Formed after initial cleaning. - Amharic-2: Developed by adjusting entry lengths and establishing bounds for text and summary lengths based on fine-tuning outcomes. - Amharic-3: Further refined from Amharic-2 using additional cleaning steps informed by common Amharic text preprocessing practices.

    Each version of the dataset was split into training, validation, and test sets in an 80%, 10%, and 10% distribution, respectively, to support effective model training and evaluation.

    Columns - text: The main body of the Amharic text. - summary: Condensed version or summary of the text. - is_non_amharic_text: Indicator if the text is non-Amharic (True/False). - is_non_amharic_summary: Indicator if the summary is non-Amharic (True/False). - text_length: Length of the text in characters. - summary_length: Length of the summary in characters. - text_word_count: Number of words in the text. - summary_word_count: Number of words in the summary. - text_token_count: Count of tokens in the text. - summary_token_count: Count of tokens in the summary.

    For an in-depth understanding of the dataset and the methodologies used, please consult the final report: Link to Final Report

    For access to the specific coding details of the modifications, refer to the following notebook: Link to Notebook.

    This dataset aims to serve as a valuable resource for researchers and practitioners focusing on low-resource languages like Amharic.

  3. Email Thread Summary Dataset

    • kaggle.com
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marawan Mamdouh (2023). Email Thread Summary Dataset [Dataset]. https://www.kaggle.com/datasets/marawanxmamdouh/email-thread-summary-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marawan Mamdouh
    Description

    Email Thread Summary Dataset

    Overview:

    The Email Thread Dataset consists of two main files: email_thread_details and email_thread_summaries. These files collectively offer a comprehensive compilation of email thread information alongside human-generated summaries.

    Email Thread Details:

    Description:

    The email_thread_details file provides a detailed perspective on individual email threads, encompassing crucial information such as subject, timestamp, sender, recipients, and the content of the email.

    Columns:

    • thread_id: A unique identifier for each email thread.
    • subject: Subject of the email thread.
    • timestamp: Timestamp indicating when the message was sent.
    • from: Sender of the email.
    • to: List of recipients of the email.
    • body: Content of the email message.

    Additional Information:

    The "to" column is available in both CSV and Pickle (pkl) formats, facilitating convenient access to recipient information as a column of lists of strings.

    Email Thread Summaries:

    Description:

    The email_thread_summaries file contains concise summaries crafted by human annotators for each email thread, offering a high-level overview of the content.

    Columns:

    • thread_id: A unique identifier for each email thread.
    • summary: A concise summary of the email thread.

    Dataset Structure:

    The dataset is organized into threads and emails. There are a total of 4,167 threads and 21,684 emails, providing a rich source of information for analysis and research purposes.

    • Threads: 4,167 threads
    • Emails: 21,684 emails

    Language:

    • Languages: English (en)

    Use Cases:

    1. Natural Language Processing (NLP) Research:
      • Analyze email thread contents and human-generated summaries for advancements in NLP tasks.
    2. Text Summarization Models:
      • Train and evaluate text summarization models using the provided email threads and summaries.
    3. Email Analytics:
      • Gain insights into communication patterns, sender-receiver relationships, and content analysis.

    File Formats:

    • CSV Files:
      • Easily importable into various data analysis tools.
    • Pickle (pkl) Files:
      • Facilitates direct reading of the "to" column as a column of lists of strings.
    • JSON Files:

      • Offers compatibility with JSON data structures, providing an additional option for users who prefer or require this widely-used format in their analytical workflows.
      • ****JSON File Features Description****

        [
          {
            "thread_id": [unique identifier],
            "subject": "[email thread subject]",
            "timestamp": [timestamp in milliseconds],
            "from": "[sender's name and identifier]",
            "to": [
              "[recipient 1]",
              "[recipient 2]",
              "[recipient 3]",
              ...
            ],
            "body": "[email content]"
          },
          ...
        ]
        
        [
          {
            "thread_id": [unique identifier],
            "summary": "[summary content]"
          },
          ...
        ]
        

    ****Files Structure:****

    - Dataset
     ├── CSV
     │  ├── email_thread_details.csv
     │  └── email_thread_summaries.csv
     ├── Pickle
     │  ├── email_thread_details.pkl
     │  └── email_thread_summaries.pkl
     └── JSON
       ├── email_thread_details.json
       └── email_thread_summaries.json
    

    License:

    This dataset is provided under the MIT License.

    Disclaimer:

    The dataset has been anonymized and sanitized to ensure privacy and confidentiality.

  4. h

    summary_dataset_en

    • huggingface.co
    Updated Mar 23, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radim Közl (2009). summary_dataset_en [Dataset]. https://huggingface.co/datasets/KRadim/summary_dataset_en
    Explore at:
    Dataset updated
    Mar 23, 2009
    Authors
    Radim Közl
    License

    https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/

    Description

    Content

    This dataset was created from three datasets:

    BBC News Summary CNN-DailyMail News Text Summarization Generated text from LLM models

    Than was create Kaggle dataset:

    Text for summarize NLP/LLM task (En)

    The dataset was filtered, shuffled, and divided into parts before being saved to Hugging Face.

      Acknowledgements
    

    The dataset was created with the support of resources from SV metal spol. s r.o.

  5. Bengali News Summarization Dataset

    • kaggle.com
    Updated Sep 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PrithwirajSust (2020). Bengali News Summarization Dataset [Dataset]. https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PrithwirajSust
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    BANSData: A Dataset for Bengali Abstractive News Summarization

    This work is accepted at TCCE-2020. The paper is available at springer AISC proceedings: https://doi.org/10.1007/978-981-33-4673-4_4 arxiv: https://arxiv.org/pdf/2012.01747.pdf

    Abstract

    Nowadays news or text summarization becomes very popular in the NLP field. Both the extractive and abstractive approaches of summarization are implemented in different languages. A significant amount of data is a primary need for any summarization. For the Bengali language, there are only a few datasets are available. Our dataset is made for Bengali Abstractive News Summarization (BANS) purposes. As abstractive summarization is basically neural network-based it needs more and more data to perform well. So we made a standard Bengali abstractive summarization data crawling from online Bengali news portal bangla.bdnews24.com. We crawled more than 19k articles and summaries and standardized the data.

    List of files: 1. article.txt 2. summary.txt

    Dataset Description

    DescriptionData Info.
    Total no of articles19096
    Total no of summaries19096
    Maximum no of words in an article76
    Maximum no of words in a summary12
    Minimum no of words in an article5
    Minimum no of words in a summary3

    Acknowledgement

    We would like to thank Shahjalal University of Science and Technology (SUST) research center and SUST NLP research group for their support.

    Bibtex for Citation

    @InProceedings{10.1007/978-981-33-4673-4_4, author="Bhattacharjee, Prithwiraj and Mallick, Avi and Saiful Islam, Md. and Marium-E-Jannat", editor="Kaiser, M. Shamim and Bandyopadhyay, Anirban and Mahmud, Mufti and Ray, Kanad", title="Bengali Abstractive News Summarization (BANS): A Neural Attention Approach", booktitle="Proceedings of International Conference on Trends in Computational and Cognitive Engineering", year="2021", publisher="Springer Singapore", address="Singapore", pages="41--51", abstract="Bhattacharjee, PrithwirajMallick, AviSaiful Islam, Md.Marium-E-JannatAbstractive summarization is the process of generating novel sentences based on the information extracted from the original text document while retaining the context. Due to abstractive summarization's underlying complexities, most of the past research work has been done on the extractive summarization approach. Nevertheless, with the triumph of the sequence-to-sequence (seq2seq) model, abstractive summarization becomes more viable. Although a significant number of notable research has been done in the English language based on abstractive summarization, only a couple of works have been done on Bengali abstractive news summarization (BANS). In this article, we presented a seq2seq based Long Short-Term Memory (LSTM) network model with attention at encoder-decoder. Our proposed system deploys a local attention-based model that produces a long sequence of words with lucid and human-like generated sentences with noteworthy information of the original document. We also prepared a dataset of more than 19 k articles and corresponding human-written summaries collected from bangla.bdnews24.com (https://bangla.bdnews24.com/) which is till now the most extensive dataset for Bengali news document summarization and publicly published in Kaggle (https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset) We evaluated our model qualitatively and quantitatively and compared it with other published results. It showed significant improvement in terms of human evaluation scores with state-of-the-art approaches for BANS.", isbn="978-981-33-4673-4" }

  6. Legal Text Classification Dataset

    • kaggle.com
    Updated Oct 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A.Mohan kumar (2023). Legal Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    A.Mohan kumar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.

  7. g3_sum

    • kaggle.com
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Messam Naqvi (2024). g3_sum [Dataset]. https://www.kaggle.com/datasets/messamnaqvi/g3-sum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Messam Naqvi
    Description

    Context: The g3-sum dataset is created for the purpose of developing and evaluating models for abstractive summarization of Urdu talk show scripts. It provides a structured dataset to facilitate research and application in natural language processing**(NLP)** for the Urdu language.

    Sources: The training data is collected from YouTube scripts of Urdu talk shows, capturing dialogues and discussions from various programs. The dataset includes both scripts and their corresponding human-written summaries.

    Inspiration: The inspiration behind this dataset is to improve the accessibility and understanding of lengthy Urdu talk show content by creating concise and accurate summaries. This project aims to enhance the usability of Urdu text data for both researchers and end-users by leveraging advanced summarization techniques.

  8. Human Written Text

    • kaggle.com
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youssef Elebiary (2025). Human Written Text [Dataset]. https://www.kaggle.com/datasets/youssefelebiary/human-written-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Youssef Elebiary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.

    Data Source Distribution

    1. 10,000 Wikipedia Articles: From the 20220301 dump.
    2. 3,000 Gutenberg Books: Via the GutenDex API.
    3. 7,000 CNN/DailyMail News Articles: From the CNN/DailyMail 3.0.0 dataset.

    Why These Sources

    The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.

    Dataset Structure

    The dataset consists of 5 CSV files. 1. CNN_DailyMail.csv: Contains all processed news articles. 2. Gutenberg.csv: Contains all processed books. 3. Wikipedia.csv: Contains all processed Wikipedia articles. 4. Human.csv: Combines all three datasets in order. 5. Shuffled_Human.csv: This is the randomly shuffled version of Human.csv.

    Each file has 2 columns: - Title: The title of the item. - Text: The content of the item.

    Uses

    This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.

    Disclaimer

    While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.

    For details on how the dataset was created, click here to view the Kaggle notebook used.

    Licensing

    This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.

  9. TinyStories

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    TinyStories

    A Diverse, Richly Annotated Corpus of Short-Form Stories

    By Huggingface Hub [source]

    About this dataset

    This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

    The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

    To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

    Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

    By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

    Research Ideas

    • Creating a text classification algorithm to automatically categorize short stories by genre.
    • Developing an AI-based summarization tool to quickly summarize the main points in a story.
    • Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

    File: train.csv | Column name | Description | |:--------------|:----------------------------...

  10. AirBNB reviews Dataset

    • kaggle.com
    Updated Jan 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Ahmed Ansari (2023). AirBNB reviews Dataset [Dataset]. https://www.kaggle.com/datasets/muhammadahmedansari/airbnb-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Ahmed Ansari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary Review data and Listing ID (to facilitate time-based analytics and visualizations linked to a listing). This dataset can be used for NLP usecases, e.g. Exploratory Data Analysis (EDA), Text summarization, sentiment analysis, intent analysis and many more.

  11. Stop words in 28 languages

    • kaggle.com
    Updated Sep 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heeral Dedhia (2020). Stop words in 28 languages [Dataset]. https://www.kaggle.com/heeraldedhia/stop-words-in-28-languages/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Heeral Dedhia
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    STOPWORDS

    Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.

    When to remove stop words?

    If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus, but if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.

    There is no hard and fast rule on when to remove stop words. But I would suggest removing stop words if our task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification. On the other hand, if our task is one of Machine Translation, Question-Answering problems, Text Summarization, Language Modeling, it’s better not to remove the stop words as they are a crucial part of these applications.

    Pros and Cons:

    One of the first things that we ask ourselves is what are the pros and cons of any task we perform. Let’s look at some of the pros and cons of stop word removal in NLP.

    pros:

    • Stop words are often removed from the text before training deep learning and machine learning models since stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
    • On removing stopwords, dataset size decreases, and the time to train the model also decreases without a huge impact on the accuracy of the model.
    • Stopword removal can potentially help in improving performance, as there are fewer and only significant tokens left. Thus, the classification accuracy could be improved

    cons:

    Improper selection and removal of stop words can change the meaning of our text. So we have to be careful in choosing our stop words. Ex: “ This movie is not good.” If we remove (not) in pre-processing step the sentence (this movie is good) indicates that it is positive which is wrongly interpreted.

    Available languages

    • Arabic
    • Bulgarian
    • Catalan
    • Czech
    • Danish
    • Dutch
    • English
    • Finnish
    • French
    • German
    • Gujarati
    • Hindi
    • Hebrew
    • Hungarian
    • Indonesian
    • Malaysian
    • Italian
    • Norwegian
    • Polish
    • Portuguese
    • Romanian
    • Russian
    • Slovak
    • Spanish
    • Swedish
    • Turkish
    • Ukrainian
    • Vietnamese
  12. injection-molding-QA

    • kaggle.com
    • huggingface.co
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Keser (2024). injection-molding-QA [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/injection-molding-qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Kaggle
    Authors
    Mustafa Keser
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    injection-molding-QA

    Description

    This dataset contains questions and answers related to injection molding, focusing on topics such as 'Materials', 'Techniques', 'Machinery', 'Troubleshooting', 'Safety','Design','Maintenance','Manufacturing','Development','R&D'. The dataset is provided in CSV format with two columns: Questions and Answers.

    Usage

    Researchers, practitioners, and enthusiasts in the field of injection molding can utilize this dataset for tasks such as:

    • Natural Language Processing (NLP) tasks such as question answering, text generation, and summarization.
    • Training and evaluation of machine learning models for understanding and generating responses related to injection molding.

    Example pandas

    import pandas as pd
    
    # Load the dataset
    dataset = pd.read_csv('injection_molds_dataset.csv')
    
    # Display the first few rows
    print(dataset. Head())
    

    Example datasets

    from datasets import load_dataset
    
    # Load the dataset
    dataset = load_dataset("mustafakeser/injection-molding-QA")
    
    # Display dataset info
    print(dataset)
    
    # Accessing the first few examples
    print(dataset['train'][:5])
    #or 
    dataset['train'].to_pandas()
    
    

    Columns

    1. Questions: Contains questions related to injection molding.
    2. Answers: Provides detailed answers to the corresponding questions.

    Citation

    If you use this dataset in your work, please consider citing it as:

    @misc{injectionmold_dataset,
     author = {Your Name},
     title = {Injection Molds Dataset},
     year = {2024},
     publisher = {Hugging Face},
     journal = {Hugging Face Datasets},
     howpublished = {\url{link to the dataset}},
    }
    

    Huggingface

    https://huggingface.co/datasets/mustafakeser/injection-molding-QA mustafakeser/injection-molding-QA

    Notes

    This dataset curated with gemini-1.0-pro

    license: apache-2.0

  13. Data from: Books Information Dataset

    • kaggle.com
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav Jadhav (2024). Books Information Dataset [Dataset]. https://www.kaggle.com/datasets/jadhavpranav/books-information-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pranav Jadhav
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains information about different books that are sold over the Internet. It can be used for multiple NLP tasks using sentimental analysis, text classification, summary generation, recommendation system generation and many more.

  14. Walmart Reviews Dataset

    • kaggle.com
    Updated Sep 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshal H (2023). Walmart Reviews Dataset [Dataset]. https://www.kaggle.com/harshalhonde/walmart-reviews-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Harshal H
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The Walmart Customer Reviews Dataset offers a wealth of insights into consumer sentiment and product feedback related to one of the world's largest retail giants, Walmart. This dataset contains a vast collection of customer reviews, star ratings, and other relevant information that has been gathered through web scraping and data compilation.

    Key Features: - Customer Reviews: Detailed textual reviews provide firsthand accounts of shopping experiences and product satisfaction. - Star Ratings: Each review is accompanied by a star rating, allowing for sentiment analysis and product rating assessment. - Review Dates: The dataset includes review submission dates, facilitating temporal analysis and trend detection. - Product Identification: For some reviews, product identification details such as SKU numbers or product categories are provided.

    Use Cases:

    • Sentiment Analysis: Researchers and data analysts can perform sentiment analysis to understand customer sentiment toward specific products, categories, or Walmart as a whole.
    • Product Quality Assessment: Assess the quality and performance of Walmart products based on customer feedback.
    • Market Research: Gain insights into consumer preferences and trends in the retail industry.
    • NLP Applications: Utilize the textual reviews for natural language processing (NLP) tasks such as text classification, summarization, and topic modelling.
  15. Indonesia News Portal Headlines Dataset

    • kaggle.com
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayesq Prameswari (2025). Indonesia News Portal Headlines Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10831968
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mayesq Prameswari
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains news headlines collected from 20 major Indonesian news portals through web scraping conducted on February 23, 2025. The dataset is structured into three key components: the source of the news, the headline title, and the date of publication. By compiling headlines from multiple sources, this dataset provides a comprehensive snapshot of trending topics across different media outlets in Indonesia. It can be utilized for various analytical and research purposes, such as trending topic analysis, sentiment analysis, and natural language processing (NLP) applications. Researchers can use this dataset to track public sentiment, identify recurring themes in news coverage, and train machine learning models for text-based tasks such as classification, keyword extraction, and summarization.

    With 1,174 rows and 3 columns, this dataset contains no missing values, ensuring its usability for data analysis and modeling. The three available variables are: source, which represents the name of the news portal where the headline was published; title, which contains the actual headline of the news article; and date, which indicates the publication date of each news piece. These variables make it possible to conduct media monitoring, study media bias, and compare how different news platforms report on similar topics. Additionally, the dataset is valuable for time-series analysis, allowing users to observe how news trends evolve over time.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Prasad Magdum (2023). DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION [Dataset]. https://www.kaggle.com/datasets/prasadmagdum/nlpproject
Organization logo

DATASET FOR VIDEO TRANSCRIPT SUMMARIZATION

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prasad Magdum
Description

This dataset is for summarization of video transcript.This contains transcript of videos from 26 different categories. E.g. Sports,Education,Medical,News,etc.

Search
Clear search
Close search
Google apps
Main menu