31 datasets found
  1. Pull Request Review Comments Dataset

    • zenodo.org
    application/gzip, bin
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Sinha; Akshay Sinha (2025). Pull Request Review Comments Dataset [Dataset]. http://doi.org/10.5281/zenodo.4773068
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Akshay Sinha; Akshay Sinha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pull Request Review Comments (PRRC) Datasets

    Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.

    First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2019. It consisted of 37,358,242 PRRCs and is over 12 Gigabytes in size. It took over 100 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:

    1. Pull Request Review Comments Dataset: This is the dataset of the comments from the first 100,000 threads in the source dataset from gharchive.
    2. Pull Request Review Threads Dataset: This is the dataset of comments that were concatenated together if they were from the same thread.

    Description

    The dataset is stored in the JSONLines format, as was the source dataset from gharchive.

    For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.

    • `comment_id` is the field which specifies the ID GitHub uses for that comment.
    • `commit_id` is the field which specifies the ID of the commit proposed in the pull request.
    • `url` is the field which specifies the url to the comment in a pull request thread.
    • `author` is the field which lists the username of the author of the pull request.
    • `created_at` is the field which specifies the time at which the pull request comment was created.
    • `body` is the field which describes the contents of the PRRC.

    The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.

    Construction

    We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.

  2. h

    baize_chatbot

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mining Tasky Data, baize_chatbot [Dataset]. https://huggingface.co/datasets/taskydata/baize_chatbot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Mining Tasky Data
    Description
  3. h

    ai-medical-chatbot

    • huggingface.co
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruslan Magana Vsevolodovna (2024). ai-medical-chatbot [Dataset]. https://huggingface.co/datasets/ruslanmv/ai-medical-chatbot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Authors
    Ruslan Magana Vsevolodovna
    Description

    AI Medical Chatbot Dataset

    This is an experimental Dataset designed to run a Medical Chatbot It contains at least 250k dialogues between a Patient and a Doctor.

      Playground ChatBot
    

    ruslanmv/AI-Medical-Chatbot For furter information visit the project here: https://github.com/ruslanmv/ai-medical-chatbot

  4. f

    WHO data: frequently asked questions about COVID-19

    • uvaauas.figshare.com
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emma E. Schreurs (2023). WHO data: frequently asked questions about COVID-19 [Dataset]. http://doi.org/10.21942/uva.17129309.v1
    Explore at:
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    University of Amsterdam / Amsterdam University of Applied Sciences
    Authors
    Emma E. Schreurs
    License

    https://rdx.lab.surf.nl/licensehttps://rdx.lab.surf.nl/license

    Description

    In connection with these unfortunate corona times, I have composed a relevant dataset. Namely, one from World Health Organization (WHO). There is a page on the WHO website which provides answers to frequently asked questions about the coronavirus. This is exactly the type of data which is suitable for a chatbot. The dataset we collected contains 86 possible answers to various topics regarding the coronavirus (including “What is a coronavirus” and “What are the symptoms of COVID-19”). The relevant data set and code are available at the corresponding GitHub page. - The link to this GitHub page can be found under references.- The link to the blog about the Chatbot based on this dataset can also be found under references.

  5. Chatbot Dataset Topical Chat

    • kaggle.com
    Updated Dec 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnav Sharma AS (2020). Chatbot Dataset Topical Chat [Dataset]. https://www.kaggle.com/arnavsharmaas/chatbot-dataset-topical-chat/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 20, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arnav Sharma AS
    Description

    This is a Topical Chat dataset from Amazon! It consists of over 8000 conversations and over 184000 messages!

    Within each message, there is: A conversation id, which is basically which conversation the message takes place in. Each message is either the start of a conversation or a reply from the previous message. There is also a sentiment, which represents the emotion that the person who sent the message is feeling. There are 8 sentiments: Angry, Curious to Dive Deeper, Disguised, Fearful, Happy, Sad, and Surprised.

    This dataset can be used in machine learning to simulate a conversation or to make a chatbot. It can also be used for data visualization, for example you could visualize the word usage for the different emotions.

    PS: If you cannot download the dataset, download it from here: https://docs.google.com/spreadsheets/d/1dFdlvgmyXfN3SriVn5Byv_BNtyroICxdgrQKBzuMA1U/edit?usp=sharing

    Original github dataset: https://github.com/alexa/Topical-Chat

  6. o

    READ-IT Chatbot Reading Expriences

    • ordo.open.ac.uk
    json
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Antonini (2023). READ-IT Chatbot Reading Expriences [Dataset]. http://doi.org/10.21954/ou.rd.13251584.v3
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    The Open University
    Authors
    Alessio Antonini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset (answers.jsonld) is been collected through the READ-IT chatbots in English, French, Dutch, Italian and Czech (see list on the contribute page of the READ-IT website https://readit-project.eu/contact/contribute-to-read-it/) Data in JSON-LD format. The documentation about the schema used is on the READ-IT GitHub pageData is also available for researchers on the READ-IT contribution portal https://read-it.in-two.com/

  7. d

    Replication Data for: Are chatbots reliable text annotators? Sometimes

    • search.dataone.org
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristensen-McLachlan, Ross Deans; Canavan, Miceal; Kardos, Márton; Jacobsen, Mia; Aarøe, Lene (2024). Replication Data for: Are chatbots reliable text annotators? Sometimes [Dataset]. http://doi.org/10.7910/DVN/TM7ZKD
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kristensen-McLachlan, Ross Deans; Canavan, Miceal; Kardos, Márton; Jacobsen, Mia; Aarøe, Lene
    Description

    NB: In order to reproduce figures found in the article in the simplest manner, one should clone the GitHub repository and run the plotting script directly after installing necessary requirements. The corresponding files used for making these plots can also be found here under "output.zip". Classifying tweets with large language models with zero- and few-shot learning with custom and generic prompts, as well as supervised learning algorithms for comparison. The full GitHub repository for this data can be found at this URL or by following the link under the "metadata" tab. This GitHub repo contains an extensive README file explaining how to run the code and reproduce the results and plots found in the article. The present Dataverse repository contains all code and prompts used to generate predictions on the human annotated data, as well as the code book used by human annotators on this data. Due to data sharing policies at X (formerly Twitter), we are unable to share full texts from the Tweets used in our study. Instead, we have provided Tweet IDs, unique identifiers which can be used to re-scrape the data if desired.

  8. Raw data from experiments in winter 2023 / 2024

    • figshare.com
    txt
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Rachmann (2024). Raw data from experiments in winter 2023 / 2024 [Dataset]. http://doi.org/10.6084/m9.figshare.25585266.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Alexander Rachmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generated data from experiments with the Six Thinking Chatbot, see https://github.com/rachmann-alexander/sixthinkingchatbots

  9. h

    baize-chat-data

    • huggingface.co
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Link An Jarad (2023). baize-chat-data [Dataset]. https://huggingface.co/datasets/linkanjarad/baize-chat-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2023
    Authors
    Link An Jarad
    Description

    Dataset Description

    Original Repository: https://github.com/project-baize/baize-chatbot/tree/main/data This is a dataset of the training data used to train the Baize family of models. This dataset is used for instruction fine-tuning of LLMs, particularly in "chat" format. Human and AI messages are marked by [|Human|] and [|AI|] tags respectively. The data from the orignial repo consists of 4 datasets (alpaca, medical, quora, stackoverflow), and this dataset combines all four into… See the full description on the dataset page: https://huggingface.co/datasets/linkanjarad/baize-chat-data.

  10. h

    chatbot-ai-agent

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNLP, chatbot-ai-agent [Dataset]. https://huggingface.co/datasets/DeepNLP/chatbot-ai-agent
    Explore at:
    Authors
    DeepNLP
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Chatbot Agent Meta and Traffic Dataset in AI Agent Marketplace | AI Agent Directory | AI Agent Index from DeepNLP

    This dataset is collected from AI Agent Marketplace Index and Directory at http://www.deepnlp.org, which contains AI Agents's meta information such as agent's name, website, description, as well as the monthly updated Web performance metrics, including Google,Bing average search ranking positions, Github Stars, Arxiv References, etc. The dataset is helpful for AI… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/chatbot-ai-agent.

  11. f

    datasheet1_AVA: A Financial Service Chatbot Based on Deep Bidirectional...

    • frontiersin.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi Yu; Yuxin Chen; Hussain Zaidi (2023). datasheet1_AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers.pdf [Dataset]. http://doi.org/10.3389/fams.2021.604842.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Shi Yu; Yuxin Chen; Hussain Zaidi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We develop a chatbot using deep bidirectional transformer (BERT) models to handle client questions in financial investment customer service. The bot can recognize 381 intents, decides when to say I don’t know, and escalate escalation/uncertain questions to human operators. Our main novel contribution is the discussion about the uncertainty measure for BERT, where three different approaches are systematically compared with real problems. We investigated two uncertainty metrics, information entropy and variance of dropout sampling, in BERT, followed by mixed-integer programming to optimize decision thresholds. Another novel contribution is the usage of BERT as a language model in automatic spelling correction. Inputs with accidental spelling errors can significantly decrease intent classification performance. The proposed approach combines probabilities from masked language model and word edit distances to find the best corrections for misspelled words. The chatbot and the entire conversational AI system are developed using open-source tools and deployed within our company’s intranet. The proposed approach can be useful for industries seeking similar in-house solutions in their specific business domains. We share all our code and a sample chatbot built on a public data set on GitHub.

  12. Data from: CoAct Citizen Science chatbot explores social support networks in...

    • zenodo.org
    zip
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franziska Peter; Franziska Peter; Isabelle Bonhoure; Isabelle Bonhoure; Anna Cigarini; Anna Cigarini; Josep Perelló; Josep Perelló (2023). Data from: CoAct Citizen Science chatbot explores social support networks in mental health based on lived experiences [Dataset]. http://doi.org/10.5281/zenodo.7443141
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Franziska Peter; Franziska Peter; Isabelle Bonhoure; Isabelle Bonhoure; Anna Cigarini; Anna Cigarini; Josep Perelló; Josep Perelló
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A data set on lived experiences in the context of social support in mental health, created within a Citizen Social Science project.


    Societies around the world increasingly encounter wicked and complex problems, such as those related to mental health, environmental justice, and youth employment. CoAct as a EU-funded global effort addresses these problems by deploying Citizen Social Science.

    Citizen Social Science is understood here as participatory research co-designed and directly driven by citizen groups sharing a social concern. This methodology wants to give citizen groups an equal ‘seat at the table’ through active participation in research, from the design to the interpretation of the results and their transformation into concrete actions. Citizens thus act as co-researchers and are recognised as in-the-field competent experts.

    In Barcelona, a group of 32 co-researchers work together with the OpenSystems group, Universitat de Barcelona, the Catalan Federation of Mental Health (Federació Salut Mental Catalunya), and with the help of many others on a better understanding of informal social support networks in mental health in the project CoActuem per la Salut Mental (lit. “We act together for mental health”). The co-researchers, who are either persons with a personal history of mental health problems or are family members of the latter, contributed their personal experiences related to social support in the form of 222 micro-stories, each shorter than 400 characters, and most accompanied by an illustration by Pau Badia.

    Those micro-stories form the heart of the first co-created Citizen Science chatbot, the code of which is open on https://github.com/Chaotique/CoActuem_per_la_Salut_Mental_Chatbot.git . The Telegram chatbot sends them to participants on a daily basis over the course of a year and asks them either, whether they and/ or their close surrounding lived this experience, too (stories of type C), or, how they would or would have reacted in the presented situation (stories of type T). The answers of each participant can be contrasted with the individual participants’ answer to a 32-questions socio-demographic survey. Further, the timing of the messages is included to allow for a broader analysis.

    The chatbot is still running, hence this data set will still be updated. For further information on the project CoAct, see https://coactproject.eu/. For further details on the co-creation process and purpose of the chatbot CoActuem per la Salut Mental, take a look on https://coactuem.ub.edu/. Please direct your questions regarding the data set to coactuem[at]ub.edu.

    Acknowledgements

    The CoAct project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement number 873048. We especially thank the co-researchers for the passion and time invested.

  13. Vision Language Models for Bio-image Data Science

    • meta4ds.fokus.fraunhofer.de
    pdf, unknown
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Vision Language Models for Bio-image Data Science [Dataset]. https://meta4ds.fokus.fraunhofer.de/datasets/oai-zenodo-org-15735577?locale=en
    Explore at:
    unknown(38093729), pdf(4259680)Available download formats
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this talk, I demonstrate potential use-cases for vision-language models (VLM) in bio-image data science, focusing on how to analyse microscopy image data. It covers these use-cases: cell counting bounding-box segmentation image descriptions VLMs guessing which algorithm to use for processing Data analysis code generation Answering github issues The talk also points at a number of VLM-based open-source tools which start reshaping the scientific bio-image data science domain: bia-bob unprompted git-bob napari-chatgpt bioimage.io chatbot

  14. h

    sales-conversations-2

    • huggingface.co
    Updated Oct 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ENGEL (2023). sales-conversations-2 [Dataset]. https://huggingface.co/datasets/goendalf666/sales-conversations-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 4, 2023
    Authors
    ENGEL
    Description

    Dataset Card for "sales-conversations-2"

      Dataset Card for "sales-conversations"
    

    This dataset was created for the purpose of training a sales agent chatbot that can convince people. The initial idea came from: textbooks is all you need https://arxiv.org/abs/2306.11644 gpt-3.5-turbo was used for the generation See the main model or github for more information salesGPT_v2: https://huggingface.co/goendalf666/salesGPT_v2 github: https://github.com/tom813/salesGPT_foundation… See the full description on the dataset page: https://huggingface.co/datasets/goendalf666/sales-conversations-2.

  15. Quora Chat Dutch

    • zenodo.org
    • huggingface.co
    • +1more
    bin
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy; Bram Vanroy (2023). Quora Chat Dutch [Dataset]. http://doi.org/10.57967/hf/0524
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bram Vanroy; Bram Vanroy
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description

    This dataset contains 54,444 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Baize's machine-generated answers to the Quora dataset.

    Data Instances

    {
     'id': 10,
     'input': 'Het gesprek tussen de mens en de AI-assistent.
    '
          '[|Human|] Ik vroeg me af wanneer de Moto G3 de Android 7.0-update '
          'zal krijgen?
    '
          '[|AI|] De Moto G3 (Moto G 3rd gen 2015) komt in aanmerking voor de '
          'Android 7.0 (Nougat) update.
    '
          '[|Human|] Geweldig! Weet je wanneer de update beschikbaar zal '
          'zijn?
    '
          '[|AI|] De update werd uitgebracht in december 2016 voor de Moto G4 '
          'en G4 Plus. Er is echter nog geen woord over wanneer de Moto G3 de '
          'update zal ontvangen.
    '
          '[|Human|] Is er nieuws of het binnenkort de update zal ontvangen?
    '
          '[|AI|] Het spijt me, ik heb momenteel geen informatie daarover. '
          'Het is het beste om uit te kijken naar updates van Motorola.
    '
          '[|Human|] ',
     'topic': 'Wanneer krijgt Moto G3 (Moto G 3e gen 2015) de Android 7.0 '
          '(Nougat) update?'
    }
    

    Data Fields

    • id: the ID of the item. The following IDs are not included because they could not be translated: [1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740]
    • input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
    • topic: the topic description

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is:

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
    
    Here are the requirements that you should adhere to:
    1. maintain the format: the conversation consists of the AI (marked as `[|AI|]`) and the human (`[|Human|]`) talking in turns and responding to each other;
    2. do not translate the speaker identifiers `[|AI|]` and `[|Human|]` but always copy them into the translation in appropriate places;
    3. ensure accurate translation and keep the correctness of the conversation;
    4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
    5. translate the human's text using informal, but standard, language;
    6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
    7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
    8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
    9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
    
    Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
    
    """
    

    The prompt to translate the topic is:

    TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct," \
                  " summarizing manner. Translate accurately and formally. Do not provide any explanation" \
                  " about the translation and do not include the original title.
    
    "
    

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
    

    Note that 12 items (0.02%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740].

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    Licensing info for quora is listed as unknown. Baize is listed as GPLv3. So this dataset is also listed as such. A specific restriction is in place: as per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub which is the canonical repository for this dataset. See the link for more info.

  16. f

    MApp-KG: Constructing an RDF-based Knowledge Graph from Mobile App...

    • figshare.com
    txt
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quim Motger (2025). MApp-KG: Constructing an RDF-based Knowledge Graph from Mobile App Repositories [Dataset]. http://doi.org/10.6084/m9.figshare.24760032.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    figshare
    Authors
    Quim Motger
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Please cite this resource as follows:Motger, Q., Franch, X., Marco, J. (2024). MApp-KG: Mobile App Knowledge Graph for Document-Based Feature Knowledge Generation. In: Islam, S., Sturm, A. (eds) Intelligent Information Systems. CAiSE 2024. Lecture Notes in Business Information Processing, vol 520. Springer, Cham. https://doi.org/10.1007/978-3-031-61000-4_15This complementary package is the official description of MApp-KG, a resource and software-based process for generating an RDF-based knowledge graph tailored for a catalogue of mobile apps and their associated natural language documents, with a special focus on the task of feature extraction (i.e., extraction of functionalities or capabilities of a given mobile app satisfying a distinct purpose or need). Our key contributions include: (i) an end-to-end mechanism for the construction of a domain-specific knowledge graph of mobile apps and related documents; (ii) a schema-based extension to support deductive knowledge for explicit modelling of app features; (iii) a continuously updated instance of MApp-KG publicly available for further research in the field, augmented with app feature mentions; and (iv) a data snapshot of MApp-KG to facilitate reuse in further studies.In addition to the self-contained contents included in this replication package, we also refer to software resources developed within the context development of MApp-KG stored and evolved in a GitHub repository:AppDataScannerService. Python-based data collection service integrating access to multiple, decentralized heterogeneous mobile app repositories. The service covers two main data collection mechanisms: web scraping and API consumption. Available at: https://github.com/gessi-chatbots/app_data_scanner_service.KnowledgeGraphRepository. Java-based service acting as a repository component between the data collection service and the database management system. It encompasses all CRUD operations for the entities of the data model, as well as some advanced queries based on inductive knowledge generation on the knowledge graph instance. The database is developed using GraphDB4, an RDF Graph Database compliant with W3C standard. Available at: https://github.com/gessi-chatbots/app_data_repository.Both software resources are configured as standalone web services exposing a REST API for HTTP-based communication using JSON for data transfer. Instructions for configuring, building and running these services are detailed in their respective README files.MApp-KG is available as a free, public access repository (read-only). More details and usage instructions for MApp-KG are available in the README file in this replication package (including SPARQL query examples to consume the public instance of MApp-KG).

  17. A

    ‘Disease Symptom Prediction’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Disease Symptom Prediction’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-disease-symptom-prediction-154b/335de7fc/?iid=006-830&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Disease Symptom Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/itachi9604/disease-symptom-description-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot

    Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv

    Content

    There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. The user only needs to understand how rows and coloumns are arranged.

    Acknowledgements

    I have created this dataset with help of a friend Pratik Rathod. As there was an existing dataset like this which was difficult to clean.

    Query

    uchihaitachi9604@gmail.com

    --- Original source retains full ownership of the source dataset ---

  18. Disease Symptom Prediction

    • kaggle.com
    Updated May 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranay Patil (2020). Disease Symptom Prediction [Dataset]. https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2020
    Dataset provided by
    Kaggle
    Authors
    Pranay Patil
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot

    Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv

    Content

    There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. The user only needs to understand how rows and coloumns are arranged.

    The data in this CSV sheet is for reference and training purposes only, and actual data may vary.

    Acknowledgements

    I have created this dataset with help of a friend Pratik Rathod. As there was an existing dataset like this which was difficult to clean.

  19. h

    python-act

    • huggingface.co
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mesolitica (2025). python-act [Dataset]. https://huggingface.co/datasets/mesolitica/python-act
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Mesolitica
    Description

    Python Act

    Python Code Interpreter agent, originally from https://huggingface.co/datasets/xingyaoww/code-act but we run on filtered https://huggingface.co/datasets/teknium/OpenHermes-2.5 and filtered https://huggingface.co/datasets/m-a-p/Code-Feedback Source code at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/python-act

      How does it work?
    

    When first question sent to the LLM, LLM will generate problem description and code to execute inside… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/python-act.

  20. h

    synthetic-multiturn-multimodal

    • huggingface.co
    Updated Jan 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2024
    Dataset authored and provided by
    Mesolitica
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multiturn Multimodal

    We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

      multi-images
    

    synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

      Example data
    

    {'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akshay Sinha; Akshay Sinha (2025). Pull Request Review Comments Dataset [Dataset]. http://doi.org/10.5281/zenodo.4773068
Organization logo

Pull Request Review Comments Dataset

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
application/gzip, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Akshay Sinha; Akshay Sinha
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Pull Request Review Comments (PRRC) Datasets

Two datasets have been created from the gharchive website. The Pull Request Review Comment Event was selected from the set of available GitHub events. This dataset has been created for CARA: Chatbot for Automating Repairnator Actions as part of a master's thesis at KTH, Stockholm.

First, a source dataset was downloaded from gharchive. That dataset ranges from January 2015 to December 2019. It consisted of 37,358,242 PRRCs and is over 12 Gigabytes in size. It took over 100 hours to download all the data files and extract PRRC from it. From this source dataset, two subsets were derived:

  1. Pull Request Review Comments Dataset: This is the dataset of the comments from the first 100,000 threads in the source dataset from gharchive.
  2. Pull Request Review Threads Dataset: This is the dataset of comments that were concatenated together if they were from the same thread.

Description

The dataset is stored in the JSONLines format, as was the source dataset from gharchive.

For PRRC events, the source dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`.

  • `comment_id` is the field which specifies the ID GitHub uses for that comment.
  • `commit_id` is the field which specifies the ID of the commit proposed in the pull request.
  • `url` is the field which specifies the url to the comment in a pull request thread.
  • `author` is the field which lists the username of the author of the pull request.
  • `created_at` is the field which specifies the time at which the pull request comment was created.
  • `body` is the field which describes the contents of the PRRC.

The threads dataset contains the fields `url` and `body` which contain similar information as described above. However, the body field differs: it is a concatenation of all the PRRCs in a pull request thread. The comments dataset contains the fields `comment_id`, `commit_id`, `url`, `author`, `created_at`, and `body`. They are the same fields from the initial dataset.

Construction

We used the fasttext model published by Facebook to detect the language of the PRRC. Only those PRRCs in English were preserved. We also removed any PRRC or thread whose size exceeded 128 Kilobytes.

Search
Clear search
Close search
Google apps
Main menu