4 datasets found
  1. f

    datasheet1_AVA: A Financial Service Chatbot Based on Deep Bidirectional...

    • frontiersin.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi Yu; Yuxin Chen; Hussain Zaidi (2023). datasheet1_AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers.pdf [Dataset]. http://doi.org/10.3389/fams.2021.604842.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Shi Yu; Yuxin Chen; Hussain Zaidi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We develop a chatbot using deep bidirectional transformer (BERT) models to handle client questions in financial investment customer service. The bot can recognize 381 intents, decides when to say I don’t know, and escalate escalation/uncertain questions to human operators. Our main novel contribution is the discussion about the uncertainty measure for BERT, where three different approaches are systematically compared with real problems. We investigated two uncertainty metrics, information entropy and variance of dropout sampling, in BERT, followed by mixed-integer programming to optimize decision thresholds. Another novel contribution is the usage of BERT as a language model in automatic spelling correction. Inputs with accidental spelling errors can significantly decrease intent classification performance. The proposed approach combines probabilities from masked language model and word edit distances to find the best corrections for misspelled words. The chatbot and the entire conversational AI system are developed using open-source tools and deployed within our company’s intranet. The proposed approach can be useful for industries seeking similar in-house solutions in their specific business domains. We share all our code and a sample chatbot built on a public data set on GitHub.

  2. DistillChat v1: Mixture of Conversations

    • kaggle.com
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). DistillChat v1: Mixture of Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/distillchat-v1-mixture-of-conversations-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DistillChat v1: Mixture of Conversations Dataset

    Conversational Dataset with Diverse Sources

    By fanqiwan (From Huggingface) [source]

    About this dataset

    The Mixture of Conversations Dataset is a collection of conversations gathered from various sources. Each conversation is represented as a list of messages, where each message is a string. This dataset provides a valuable resource for studying and analyzing conversations in different contexts.

    The conversations in this dataset are diverse, covering a wide range of topics and scenarios. They include casual chats between friends, customer support interactions, online forum discussions, and more. The dataset aims to capture the natural flow of conversation and includes both structured and unstructured dialogues.

    Each conversation entry in the dataset is associated with metadata information such as the name or identifier of the model that generated it and the corresponding dataset it belongs to. This information helps to keep track of the source and origin of each conversation.

    The train.csv file provided in this dataset specifically serves as training data for various machine learning models. It contains an assortment of conversations that can be used to train chatbot systems, dialogue generation models, sentiment analysis algorithms, or any other conversational AI application.

    Researchers, practitioners, developers, and enthusiasts can leverage this Mixture of Conversations Dataset to analyze patterns in human communication, explore language understanding capabilities, test dialogue strategies or develop novel AI-powered conversational systems. Its versatility makes it useful for various NLP tasks such as text classification, intent recognition,sentiment analysis,and language modeling.

    By exploring this rich collection of conversational data points across different domains and platforms,you can gain valuable insights into how people communicate using textual input.The breadth and depth present within this extensive dataset provide ample opportunities for studies related to language understanding,recommendation systems,and other research areas involving human-computer interaction

    How to use the dataset

    Overview of the Dataset

    The dataset consists of conversational data represented as a list of messages. Each conversation is represented as a list of strings, where each string corresponds to a message in the conversation. The dataset also includes information about the model that generated the conversations and the name or identifier of the dataset itself.

    Accessing the Dataset

    Understanding Column Information

    This dataset has several columns:

    • conversations: A list representing each conversation; each conversation is further represented as a list containing individual messages.
    • dataset: The name or identifier of the dataset that these conversations belong to.
    • model: The name or identifier of the model that generated these conversations.

    Utilizing Conversations

    To make use

    Research Ideas

    • Chatbot Training: This dataset can be used to train chatbot models by providing a diverse range of conversations for the model to learn from. The conversations can cover various topics and scenarios, helping the chatbot to generate more accurate and relevant responses.
    • Customer Support Training: The dataset can be used to train customer support models to handle different types of customer queries and provide appropriate solutions or responses. By exposing the model to a variety of conversation patterns, it can learn how to effectively address customer concerns.
    • Conversation Analysis: Researchers or linguists may use this dataset for analyzing conversational patterns, language usage, or studying social interactions within conversations. The dataset's mixture of conversations from different sources can provide valuable insights into how people communicate in different settings or domains

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description ...

  3. Multi-turn Prompts Dataset

    • kaggle.com
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SoftAge.AI (2024). Multi-turn Prompts Dataset [Dataset]. https://www.kaggle.com/datasets/softageai/multi-turn-prompts-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SoftAge.AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.

    The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.

    Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.

    Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.

    Structure & Fields The dataset is organized into different columns, which are detailed below:

    P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases

    The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.

    Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.

  4. h

    GCD-Government_Compliants_Dataset

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Munavvara Nayab (2025). GCD-Government_Compliants_Dataset [Dataset]. https://huggingface.co/datasets/Munavvara-17/GCD-Government_Compliants_Dataset
    Explore at:
    Dataset updated
    Mar 18, 2025
    Authors
    Munavvara Nayab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📢 Government Complaint Audio Dataset (Hindi & English)

    This dataset contains bilingual audio recordings of government-related customer complaints in Hindi and English, generated using Text-to-Speech (TTS). It is designed to help in research and development of speech recognition, intent classification, sentiment analysis, and multilingual voice-based chatbots for public service platforms.

      📂 Dataset Structure
    

    GCD-Government_Complaints_Dataset/ ├── english/ │ ├──… See the full description on the dataset page: https://huggingface.co/datasets/Munavvara-17/GCD-Government_Compliants_Dataset.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shi Yu; Yuxin Chen; Hussain Zaidi (2023). datasheet1_AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers.pdf [Dataset]. http://doi.org/10.3389/fams.2021.604842.s001

datasheet1_AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers.pdf

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Shi Yu; Yuxin Chen; Hussain Zaidi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We develop a chatbot using deep bidirectional transformer (BERT) models to handle client questions in financial investment customer service. The bot can recognize 381 intents, decides when to say I don’t know, and escalate escalation/uncertain questions to human operators. Our main novel contribution is the discussion about the uncertainty measure for BERT, where three different approaches are systematically compared with real problems. We investigated two uncertainty metrics, information entropy and variance of dropout sampling, in BERT, followed by mixed-integer programming to optimize decision thresholds. Another novel contribution is the usage of BERT as a language model in automatic spelling correction. Inputs with accidental spelling errors can significantly decrease intent classification performance. The proposed approach combines probabilities from masked language model and word edit distances to find the best corrections for misspelled words. The chatbot and the entire conversational AI system are developed using open-source tools and deployed within our company’s intranet. The proposed approach can be useful for industries seeking similar in-house solutions in their specific business domains. We share all our code and a sample chatbot built on a public data set on GitHub.

Search
Clear search
Close search
Google apps
Main menu