24 datasets found
  1. f

    Data_Sheet_2_Performance analysis of large language models in the domain of...

    • frontiersin.figshare.com
    pdf
    Updated Nov 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Al Zubaer; Michael Granitzer; Jelena Mitrović (2023). Data_Sheet_2_Performance analysis of large language models in the domain of legal argument mining.pdf [Dataset]. http://doi.org/10.3389/frai.2023.1278796.s002
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Frontiers
    Authors
    Abdullah Al Zubaer; Michael Granitzer; Jelena Mitrović
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.

  2. ChatGPT Prompts on FAIR Digital Objects

    • zenodo.org
    pdf
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Blumenröhr; Nicolas Blumenröhr (2025). ChatGPT Prompts on FAIR Digital Objects [Dataset]. http://doi.org/10.5281/zenodo.15056647
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicolas Blumenröhr; Nicolas Blumenröhr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 20, 2025
    Description

    This repository contains two examples of prompting ChatGPT to resolve, analyze and evaluate a FAIR Digital Object (FDO) information record via the Handle Regsitry, considering data from digital humanities and energy research.

  3. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  4. d

    A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott McGrath (2025). A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data [Dataset]. http://doi.org/10.5061/dryad.s4mw6m9cv
    Explore at:
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Scott McGrath
    Time period covered
    Jan 1, 2023
    Description

    Objective: Our objective is to evaluate the efficacy of ChatGPT 4 in accurately and effectively delivering genetic information, building on previous findings with ChatGPT 3.5. We focus on assessing the utility, limitations, and ethical implications of using ChatGPT in medical settings. Materials and Methods: A structured questionnaire, including the Brief User Survey (BUS-15) and custom questions, was developed to assess ChatGPT 4's clinical value. An expert panel of genetic counselors and clinical geneticists independently evaluated ChatGPT 4's responses to these questions. We also involved comparative analysis with ChatGPT 3.5, utilizing descriptive statistics and using R for data analysis. Results: ChatGPT 4 demonstrated improvements over 3.5 in context recognition, relevance, and informativeness. However, performance variability and concerns about the naturalness of the output were noted. No significant difference in accuracy was found between ChatGPT 3.5 and 4.0. Notably, the effic..., Study Design This study was conducted to evaluate the performance of ChatGPT 4 (March 23rd, 2023)  Model) in the context of genetic counseling and education. The evaluation involved a structured questionnaire, which included questions selected from the Brief User Survey (BUS-15) and additional custom questions designed to assess the clinical value of ChatGPT 4's responses. Questionnaire Development The questionnaire was built on Qualtrics, which comprised twelve questions: seven selected from the BUS-15 preceded by two additional questions that we designed. The initial questions focused on quality and answer relevancy: 1.    The overall quality of the Chatbot’s response is: (5-point Likert: Very poor to Very Good) 2.    The Chatbot delivered an answer that provided the relevant information you would include if asked the question. (5-point Likert: Strongly disagree to Strongly agree) The BUS-15 questions (7-point Likert: Strongly disagree to Strongly agree) focused on: 1.    Recogniti..., , # A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data

    https://doi.org/10.5061/dryad.s4mw6m9cv

    This data was captured when evaluating the ability of ChatGPT to address questions patients may ask it about three genetic conditions (BRCA1, HFE, and MLH1). This data is associated with the JAMIA article of the similar name with the DOIÂ 10.1093/jamia/ocae128

    Description of the data and file structure

    1. Key: This tab contains the data structure, explaining the survey questions, and potential responses available.
    2. Prompt Responses: This tab contains the prompts used for ChatGPT, and the response provided from each model (3.5 and 4)
    3. GPT 4 Results: This tab provides the responses collected from the medical experts (genetic counselors and clinical geneticist) from the Qualtrics survey.
    4. Accuracy (Qx_1): This tab contains the subset of results from both the Ch...
  5. W

    ChatGPT Usage Survey Data

    • webfx.com
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebFX (2025). ChatGPT Usage Survey Data [Dataset]. https://www.webfx.com/blog/ai/chatgpt-usage-statistics/
    Explore at:
    Dataset updated
    Sep 2, 2025
    Dataset authored and provided by
    WebFX
    Variables measured
    Average words in first message, Average words per ChatGPT conversation, Average number of messages per conversation, Percentage of conversations that are commands, Percentage of conversations that start as questions, Percentage of conversations in the "learning & understanding" category, Percentage of conversations using advanced features (persona assignment / data upload)
    Description

    Analysis of 13,252 publicly shared ChatGPT conversations by WebFX to uncover usage statistics - prompt length, message count, question vs command distribution, use-case categories.

  6. f

    Data Sheet 1_A multidimensional comparison of ChatGPT, Google Translate, and...

    • frontiersin.figshare.com
    xlsx
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiyue Chen; Yan Lin (2025). Data Sheet 1_A multidimensional comparison of ChatGPT, Google Translate, and DeepL in Chinese tourism texts translation: fidelity, fluency, cultural sensitivity, and persuasiveness.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1619489.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    Frontiers
    Authors
    Shiyue Chen; Yan Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study systematically compares the translation performance of ChatGPT, Google Translate, and DeepL on Chinese tourism texts, focusing on two prompt-engineering strategies. Using a mixed-methods approach that combines quantitative expert assessments with qualitative analysis, the evaluation centers on fidelity, fluency, cultural sensitivity, and persuasiveness. ChatGPT outperformed its counterparts across all metrics, especially when culturally tailored prompts were used. However, it occasionally introduced semantic shifts, highlighting a trade-off between accuracy and rhetorical adaptation. Despite its strong performance, human post-editing remains necessary to ensure semantic precision and professional standards. The study demonstrates ChatGPT’s potential in domain-specific translation tasks while calling for continued oversight in culturally nuanced content.

  7. T

    Text Analytics Market Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Jun 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Text Analytics Market Report [Dataset]. https://www.marketreportanalytics.com/reports/text-analytics-market-89598
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The text analytics market is experiencing robust growth, projected to reach $10.49 billion in 2025 and exhibiting a remarkable Compound Annual Growth Rate (CAGR) of 39.90% from 2019 to 2033. This expansion is fueled by several key drivers. The increasing volume of unstructured data generated across various industries, including healthcare, finance, and customer service, necessitates sophisticated tools for extracting actionable insights. Furthermore, advancements in natural language processing (NLP), machine learning (ML), and artificial intelligence (AI) are empowering text analytics solutions with enhanced capabilities, such as sentiment analysis, topic modeling, and entity recognition. The rising adoption of cloud-based solutions also contributes to market growth, offering scalability, cost-effectiveness, and ease of access. Major industry players like IBM, Microsoft, and SAP are actively investing in research and development, driving innovation and expanding the market's capabilities. Competitive pressures are fostering a continuous improvement in the accuracy and efficiency of text analytics tools, making them increasingly attractive to businesses of all sizes. The growing demand for real-time insights and improved customer experience further propels market expansion. While the market enjoys significant growth momentum, certain challenges persist. Data security and privacy concerns remain paramount, necessitating robust security measures within text analytics platforms. The complexity of implementing and integrating these solutions into existing IT infrastructures can also pose a barrier to adoption, particularly for smaller businesses lacking dedicated data science teams. Furthermore, the accuracy and reliability of text analytics outputs can be affected by the quality and consistency of the input data. Overcoming these challenges through improved data governance, user-friendly interfaces, and robust customer support will be crucial for continued market expansion. Despite these restraints, the overall market outlook remains positive, driven by the continuous evolution of technology and the growing reliance on data-driven decision-making across diverse sectors. Recent developments include: January 2023- Microsoft announced a new multibillion-dollar investment in ChatGPT maker Open AI. ChatGPT, automatically generates text based on written prompts in a more creative and advanced than the chatbots. Through this investment, the company will accelerate breakthroughs in AI, and both companies will commercialize advanced technologies., November 2022 - Tntra and Invenio have partnered to develop a platform that offers comprehensive data analysis on a firm. Throughout the process, Tntra offered complete engineering support and cooperation to Invenio. Tantra offers feeds, knowledge graphs, intelligent text extraction, and analytics, which enables Invenio to give information on seven parts of the business, such as false news identification, subject categorization, dynamic data extraction, article summaries, sentiment analysis, and keyword extraction.. Key drivers for this market are: Growing Demand for Social Media Analytics, Rising Practice of Predictive Analytics. Potential restraints include: Growing Demand for Social Media Analytics, Rising Practice of Predictive Analytics. Notable trends are: Retail and E-commerce to Hold a Significant Share in Text Analytics Market.

  8. Data from: DevGPT: Studying Developer-ChatGPT Conversations

    • zenodo.org
    zip
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Xiao; Tao Xiao; Christoph Treude; Christoph Treude; Hideaki Hata; Hideaki Hata; Kenichi Matsumoto; Kenichi Matsumoto (2023). DevGPT: Studying Developer-ChatGPT Conversations [Dataset]. http://doi.org/10.5281/zenodo.8304091
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tao Xiao; Tao Xiao; Christoph Treude; Christoph Treude; Hideaki Hata; Hideaki Hata; Kenichi Matsumoto; Kenichi Matsumoto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DevGPT is a curated dataset which encompasses 17,913 prompts and ChatGPT's responses including 11,751 code snippets, coupled with the corresponding software development artifacts—ranging from source code, commits, issues, pull requests, to discussions and Hacker News threads—to enable the analysis of the context and implications of these developer interactions with ChatGPT.

  9. databricks dolly 15k

    • kaggle.com
    zip
    Updated Apr 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    databricks (2023). databricks dolly 15k [Dataset]. https://www.kaggle.com/datasets/databricks/databricks-dolly-15k/code
    Explore at:
    zip(4737034 bytes)Available download formats
    Dataset updated
    Apr 12, 2023
    Dataset provided by
    Databrickshttp://databricks.com/
    Authors
    databricks
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

    Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation

    Languages: English Version: 1.0

    Owner: Databricks, Inc.

    Dataset Overview

    databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

    Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.

    For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.

    Intended Uses

    While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.

    Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.

    Dataset

    Purpose of Collection

    As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.

    Sources

    • Human-generated data: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
    • Wikipedia: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages.

    Annotator Guidelines

    To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor.

    The annotation guidelines for each of the categories are as follows:

    • Creative Writing: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the be...
  10. Sentiment Analysis Dataset

    • kaggle.com
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samarth Kuchya (2024). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/samarthkumarkuchya/sentiment-analysis-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samarth Kuchya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This data has been created using prompt engineering over chatGPT which has following labels - 0 - negative 1 - neutral 2 - positive

  11. d

    Evaluation of large language model chatbot responses to psychotic prompts:...

    • search.dataone.org
    • datadryad.org
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elaine Shen; Fadi Hamati; Meghan Rose Donohue; Ragy Girgis; Jeremy Veenstra-VanderWeele; Amandeep Jutla (2025). Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs [Dataset]. http://doi.org/10.5061/dryad.x0k6djj00
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Elaine Shen; Fadi Hamati; Meghan Rose Donohue; Ragy Girgis; Jeremy Veenstra-VanderWeele; Amandeep Jutla
    Description

    The large language models (LLM) "chatbot" product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input and generate encouraging responses, they may have difficulty appropriately responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms, we conducted a cross-sectional, experimental study of how multiple versions of the ChatGPT product respond to psychotic and control prompts, with blind clinician ratings of response appropriateness. We found that all three tested versions of ChatGPT were much more likely to generate inappropriate responses to psychotic than control prompts, with the "Free" product showing the poorest performance. In an exploratory analysis, prompts reflecting grandiosit..., We created 79 psychotic prompts, first-person statements an individual experiencing psychosis could plausibly make to ChatGPT. Each reflected one of the five positive symptom domains assessed by the Structured Interview for Psychosis-Risk Syndromes (SIPS): unusual thought content/delusional ideas (n = 16), suspiciousness/persecutory ideas (n = 17), grandiose ideas (n = 15), perceptual disturbances/hallucinations (n = 15), and disorganized communication (n = 16). For each psychotic prompt, we created a corresponding control prompt similar in length, sentence structure and content but without psychotic elements. This yielded a total of 158 unique prompts. On 8/28 and 8/29/2025, we presented these prompts to three versions of the ChatGPT product: GPT-5 Auto (paid default at time of experiment), GPT-4o (previous paid default), and “Free†(version accessible without subscription or account), yielding 474 prompt-response pairs. Two primary raters assigned an "appropriateness" r..., # Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs

    Dataset DOI: 10.5061/dryad.x0k6djj00

    Description of the data and file structure

    This dataset contains numerical ratings of prompt-response pairs from our study, and can be used to reproduce our analyses. Note that the literal text of prompts and model responses are not provided here, but they are available from the corresponding author on reasonable request.

    Files and variables

    File: llm_psychosis_numeric_ratings.csv

    Description: This CSV file contains all numeric appropriateness ratings assigned to prompt-response pairs in a "long" format. The 1592 rows represent 474 ratings each from two primary raters (for 948 from both), 474 derived consensus ratings, and 170 ratings from a secondary rater. The seven columns are described below.

    Variables
    • pair_id: The ID of the prompt-response pair rat...,
  12. 🧠 AI-Driven Mental Health Literacy

    • kaggle.com
    zip
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). 🧠 AI-Driven Mental Health Literacy [Dataset]. https://www.kaggle.com/datasets/mexwell/ai-driven-mental-health-literacy
    Explore at:
    zip(3740 bytes)Available download formats
    Dataset updated
    Mar 5, 2024
    Authors
    mexwell
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description

    The dataset is from an Indian study which made use of ChatGPT- a natural language processing model by OpenAI to design a mental health literacy intervention for college students. Prompt engineering tactics were used to formulate prompts that acted as anchors in the conversations with the AI agent regarding mental health. An intervention lasting for 20 days was designed with sessions of 15-20 minutes on alternative days. Fifty-one students completed pre-test and post-test measures of mental health literacy, mental help-seeking attitude, stigma, mental health self-efficacy, positive and negative experiences, and flourishing in the main study, which were then analyzed using paired t-tests. The results suggest that the intervention is effective among college students as statistically significant changes were noted in mental health literacy and mental health self-efficacy scores. The study affirms the practicality, acceptance, and initial indications of AI-driven methods in advancing mental health literacy and suggests the promising prospects of innovative platforms such as ChatGPT within the field of applied positive psychology.

    Original Data

    Citation

    C K, J., & Singh, K. (2023). Dataset for: AI-Driven Mental Health Literacy: An Interventional Study from India (Final Dataset for analysis) [Data set]. PsychArchives. https://doi.org/10.23668/psycharchives.13284

    Acknowlegement

    Foto von Priscilla Du Preez 🇨🇦 auf Unsplash

  13. m

    LLM dermatological patient handouts - supplementary data

    • data.mendeley.com
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crystal Chang (2023). LLM dermatological patient handouts - supplementary data [Dataset]. http://doi.org/10.17632/5ngxkzkdp9.2
    Explore at:
    Dataset updated
    Sep 7, 2023
    Authors
    Crystal Chang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material for Assessment of Large Language Models to Generate Patient Handouts for the Dermatology Clinic: a single-blinded randomized study

    Supplementary material A describes the overall analysis and outputs for the PEMAT and readability scores.

    Supplementary material B is the code used for the statistical analysis.

    LLM_readability_scores, PEMAT, LLM_attending_rank, rater_df, and LLM_randomization_protocol are the raw data used for analysis.

    ChatGPT handouts, Bard handouts, and BingAI handouts are the respective handouts and prompts generated for this study.

  14. f

    Data Sheet 1_On the emergent capabilities of ChatGPT 4 to estimate...

    • frontiersin.figshare.com
    zip
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Piastra; Patrizia Catellani (2025). Data Sheet 1_On the emergent capabilities of ChatGPT 4 to estimate personality traits.zip [Dataset]. http://doi.org/10.3389/frai.2025.1484260.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Frontiers
    Authors
    Marco Piastra; Patrizia Catellani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study investigates the potential of ChatGPT 4 in the assessment of personality traits based on written texts. Using two publicly available datasets containing both written texts and self-assessments of the authors’ psychological traits based on the Big Five model, we aimed to evaluate the predictive performance of ChatGPT 4. For each sample text, we asked for numerical predictions on an eleven-point scale and compared them with the self-assessments. We also asked for ChatGPT 4 confidence scores on an eleven-point scale for each prediction. To keep the study within a manageable scope, a zero-prompt modality was chosen, although more sophisticated prompting strategies could potentially improve performance. The results show that ChatGPT 4 has moderate but significant abilities to automatically infer personality traits from written text. However, it also shows limitations in recognizing whether the input text is appropriate or representative enough to make accurate inferences, which could hinder practical applications. Furthermore, the results suggest that improved benchmarking methods could increase the efficiency and reliability of the evaluation process. These results pave the way for a more comprehensive evaluation of the capabilities of Large Language Models in assessing personality traits from written texts.

  15. Z

    Can Developers Prompt? A Controlled Experiment for Code Documentation...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruse, Hans-Alexander; Puhlfürß, Tim; Maalej, Walid (2024). Can Developers Prompt? A Controlled Experiment for Code Documentation Generation [Replication Package] [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13127237
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Universität Hamburg
    Authors
    Kruse, Hans-Alexander; Puhlfürß, Tim; Maalej, Walid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Summary of Artifacts

    This is the replication package for the paper titled 'Can Developers Prompt? A Controlled Experiment for Code Documentation Generation' that is part of the 40th IEEE International Conference on Software Maintenance and Evolution (ICSME), from October 6 to 11, 2024, located in Flagstaff, AZ, USA.

    Full Abstract

    Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a ChatGPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.

    Author Information

    Name Affiliation Email

    Hans-Alexander Kruse Universität Hamburg hans-alexander.kruse@studium.uni-hamburg.de

    Tim Puhlfürß Universität Hamburg tim.puhlfuerss@uni-hamburg.de

    Walid Maalej Universität Hamburg walid.maalej@uni-hamburg.de

    Citation Information

    @inproceedings{kruse-icsme-2024, author={Kruse, Hans-Alexander and Puhlf{"u}r{\ss}, Tim and Maalej, Walid}, booktitle={2022 IEEE International Conference on Software Maintenance and Evolution}, title={Can Developers Prompt? A Controlled Experiment for Code Documentation Generation}, year={2024}, doi={tba}, }

    Artifacts Overview

    1. Preprint

    The file kruse-icsme-2024-preprint.pdf is the preprint version of the official paper. You should read the paper in detail to understand the study, especially its methodology and results.

    1. Results

    The folder results includes two subfolders, explained in the following.

    Demographics RQ1 RQ2

    The subfolder Demographics RQ1 RQ2 provides Jupyter Notebook file evaluation.ipynb for analyzing (1) the experiment participants' submissions of the digital survey and (2) the ad-hoc prompts that the experimental group entered into their tool. Hence, this file provides demographic information about the participants and results for the research questions 1 and 2. Please refer to the README file inside this subfolder for installation steps of the Jupyter Notebook file.

    RQ2

    The subfolder RQ2 contains further subfolders with Microsoft Excel files specific to the results of research question 2:

    The subfolder UEQ contains three times the official User Experience Questionnaire (UEQ) analysis Excel tool, with data entered from all participants/students/professionals.

    The subfolder Open Coding contains three Excel files with the open-coding results for the free-text answers that participants could enter at the end of the survey to state additional positive and negative comments about their experience during the experiment. The Consensus file provides the finalized version of the open coding process.

    1. Extension

    The folder extension contains the code of the Visual Studio Code (VS Code) extension developed in this study to generate code documentation with predefined prompts. Please refer to the README file inside the folder for installation steps. Alternatively, you can install the deployed version of this tool, called Code Docs AI, via the VS Code Marketplace.

    You can install the tool to generate code documentation with ad-hoc prompts directly via the VS Code Marketplace. We did not include the code of this extension in this replication package due to license conflicts (GNUv3 vs. MIT).

    1. Survey

    The folder survey contains PDFs of the digital survey in two versions:

    The file Survey.pdf contains the rendered version of the survey (how it was presented to participants).

    The file SurveyOptions.pdf is an export of the LimeSurvey web platform. Its main purpose is to provide the technical answer codes, e.g., AO01 and AO02, that refer to the rendered answer texts, e.g., Yes and No. This can help you if you want to analyze the CSV files inside the results folder (instead of using the Jupyter Notebook file), as the CSVs contain the answer codes, not the answer texts. Please note that an export issue caused page 9 to be almost blank. However, this problem is negligible as the question on this page only contained one free-text answer field.

    1. Appendix

    The folder appendix provides additional material about the study:

    The subfolder tool_screenshots contains screenshots of both tools.

    The file few_shots.txt lists the few shots used for the predefined prompt tool.

    The file test_functions.py lists the functions used in the experiment.

    Revisions

    Version Changelog

    1.0.0 Initial upload

    1.1.0 Add paper preprint. Update abstract.

    1.2.0 Update replication package based on ICSME Artifact Track reviews

    License

    See LICENSE file.

  16. PROSPECT: Professional Role Effects on Specialized Perspective Enhancement...

    • zenodo.org
    zip
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keisuke Sato; Keisuke Sato (2024). PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task [Dataset]. http://doi.org/10.5281/zenodo.14567800
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Keisuke Sato; Keisuke Sato
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 29, 2024
    Description

    ### Data Availability Statement (for the paper)

    All dialogue logs and final responses collected in this study are publicly available in the PROSPECT repository on Zenodo (DOI: [to be assigned]). The repository contains PDF files of complete dialogue histories and Markdown files of final comprehensive analyses for all conditions and models used in this study, allowing for reproducibility and further analysis.

    ### README.md for Zenodo

    # PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task

    ## Overview
    This repository (PROSPECT) contains the dataset associated with the paper:
    > "Empirical Investigation of Expertise, Multiperspectivity, and Abstraction Induction in Conversational AI Outputs through Professional Role Assignment to Both User and AI"

    This research analyzed changes in dialogue logs and final responses when professional roles were assigned to both user and AI sides across multiple Large Language Models (LLMs). This repository provides the complete dialogue logs (PDF format) and final responses (Markdown format) used in the analysis.

    ## Directory Structure
    The repository structure under the top directory (`PROSPECT/`) is as follows:

    ```
    PROSPECT/
    ├── dialogue/ # Dialogue histories (PDF)
    │ ├── none/
    │ ├── ai_only/
    │ ├── user_only/
    │ └── both/
    └── final_answers/ # Final responses (Markdown)
    ├── none/
    ├── ai_only/
    ├── user_only/
    └── both/
    ```

    - **dialogue/**
    - Contains raw dialogue logs in PDF format. Subdirectories represent role assignment conditions:
    - `none/`: No roles assigned to either user or AI
    - `ai_only/`: Role assigned to AI only
    - `user_only/`: Role assigned to user only
    - `both/`: Roles assigned to both user and AI
    - **final_answers/**
    - Contains final comprehensive analysis responses in Markdown format. Directory structure mirrors that of `dialogue/`.

    ## File Naming Convention
    Files in each directory follow this naming convention:
    ```
    [AI]_[conditionNumber]-[roleNumber].pdf
    [AI]_[conditionNumber]-[roleNumber].md
    ```
    - `[AI]`: AI model name used for dialogue (e.g., ChatGPT, ChatGPT-o1, Claude, Gemini)
    - `[conditionNumber]`: Number indicating role assignment condition
    - 0: none
    - 1: ai_only
    - 2: user_only
    - 3: both
    - `[roleNumber]`: Professional role number
    - 0: No role
    - 1: Detective
    - 2: Psychologist
    - 3: Artist
    - 4: Architect
    - 5: Natural Scientist

    ### Examples:
    - `ChatGPT_3-1.pdf`: Dialogue log with ChatGPT-4o model under "both" condition (3) with detective role (1)
    - `Gemini_1-4.md`: Final response from Gemini model under "ai_only" condition (1) with architect role (4)

    ## Role Number Reference
    | roleNumber | Professional Role |
    |-----------:|:-----------------|
    | 0 | No role |
    | 1 | Detective |
    | 2 | Psychologist |
    | 3 | Artist |
    | 4 | Architect |
    | 5 | Natural Scientist|

    ## Data Description
    - **Dialogue Histories (PDF format)**
    Complete logs of questions and responses from each session, preserved as captured during the research. All dialogues were conducted in Japanese. While assistant version information is not included, implementation dates and model names are recorded within the files.
    - **Final Responses (Markdown format)**
    Excerpted responses to the final "comprehensive analysis request" as Markdown files, intended for text analysis and keyword extraction. All responses are in Japanese.

    *Note: This dataset contains dialogues and responses exclusively in Japanese. Researchers interested in lexical analysis or content analysis should consider this language specification.

    ## How to Use
    1. Please maintain the folder hierarchy after downloading.
    2. For meta-analysis or lexical analysis, refer to PDFs for complete dialogues and Markdown files for final responses.
    3. Utilize for research reproduction, secondary analysis, or meta-analysis.

    ## License
    This dataset is released under the **CC BY 4.0** License.
    - Free to use and modify, but please cite this repository (DOI) and the associated paper when using the data.

    ## Related Publication


    ## Disclaimer
    - The dialogue logs contain no personal information or confidential data.
    - The provided logs and responses reflect the research timing; identical prompts may yield different responses due to AI model updates.
    - The creators assume no responsibility for any damages resulting from the use of this dataset.

    ## Contact
    For questions or requests, please contact skeisuke@ibaraki-ct.ac.jp.

  17. A

    AI Image Generator Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). AI Image Generator Market Report [Dataset]. https://www.marketresearchforecast.com/reports/ai-image-generator-market-5135
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The AI Image Generator Market size was valued at USD 356.1 USD Million in 2023 and is projected to reach USD 1094.58 USD Million by 2032, exhibiting a CAGR of 17.4 % during the forecast period. Recent developments include: September 2023 - OpenAI, a company specializing in the generative AI industry, introduced DALL-E 3, the latest version of its image generator. This upgrade, powered by the ChatGPT controller, produces high-quality images based on natural-language prompts and incorporates ethical safeguards., May 2023 - Stability AI introduced StableStudio, an open-source version of its DreamStudio AI application, specializing in converting text into images. This open-source release enabled developers and creators to access and utilize the technology, creating a wide range of applications for text-to-image generation., April 2023 - VanceAI launched an AI text-to-image generator called VanceAI Art Generator, powered by Stable Diffusion. This tool could interpret text descriptions and generate corresponding artworks. Users could combine image types, styles, artists, and adjust sizes to transform their creative ideas into visual art., March 2023 - Adobe unveiled Adobe Firefly, a generative AI tool in beta, catering to users without graphic design skills, helping them to create images and text effects. This announcement coincided with Microsoft’s launch of Copilot, offering automatic content generation for 365 and Dynamics 365 users. These advancements in generative AI provided valuable support and opportunities for individuals facing challenges related to writing, design, or organization., March 2023 - Runway AI introduced Gen-2, a combination of AI models capable of producing short video clips from text prompts. Gen-2, an advancement over its predecessor Gen-1, would generate higher-quality clips and provide users with increased customization options.. Key drivers for this market are: Growing Adoption of Augmented Reality (AR) and Virtual Reality (VR) to Fuel the Market Growth. Potential restraints include: Concerns related to Data Privacy and Creation of Malicious Content to Hamper the Market. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.

  18. LLM Data

    • figshare.com
    xlsx
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carter Emerson (2025). LLM Data [Dataset]. http://doi.org/10.6084/m9.figshare.30066574.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Carter Emerson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from Prompts to politics: How political identity shapes AI-generated discourse on climate change

  19. Z

    Limits of ChatGPT's Conversational Pragmatics in a Turing Test About Ethics,...

    • data-staging.niaid.nih.gov
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wagner, Wolfgang; Gaskell, George; Paraschou, Eva; Lyu, Siqi; Michali, Maria; Vakali, Athina (2025). Limits of ChatGPT's Conversational Pragmatics in a Turing Test About Ethics, Commonsense, and Cultural Sensitivity [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14762323
    Explore at:
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    Aristotle University of Thessaloniki
    University of Tartu
    South East European Research Centre
    London School of Economics and Political Science
    Authors
    Wagner, Wolfgang; Gaskell, George; Paraschou, Eva; Lyu, Siqi; Michali, Maria; Vakali, Athina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Does ChatGPT deliver its explicit claim to be culturally sensitive and its implicit claim to be a friendly digital person when conversing with human users? These claims are investigated from the perspective of linguistic pragmatics, particularly Grice's cooperative principle in communication. Following the pattern of real-life communication, turn-taking conversations reveal limitations in the LLM's grasp of the entire contextual setting described in the prompt. The prompts included ethical issues, a hiking adventure, geographical orientation and bodily movement. For cultural sensitivity the prompts came from a Pakistani Muslim in English language, from a Hindu in English, and from a Chinese in Chinese language. The issues were deeply cultural issues involving feelings and affects. Qualitative analysis of the conversation pragmatics showed that ChatGPT is often unable to conduct conversations according to the pragmatic principles of quantity, reliable quality, remaining in focus, and being clear in expression. We conclude that ChatGPT should not be presented as a global LLM but be subdivided into several culture-specific modules.

  20. m

    A dataset comparing the performance of a publicly available generative...

    • data.mendeley.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexei Birkun (2025). A dataset comparing the performance of a publicly available generative artificial intelligence chatbot and its customised version in providing first aid guidance for seizures [Dataset]. http://doi.org/10.17632/6h53jrhf7t.1
    Explore at:
    Dataset updated
    Jun 30, 2025
    Authors
    Alexei Birkun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The applicability of generative artificial intelligence chatbots as first aid consultants is a topical issue. This dataset contains the results of an analysis comparing the quality of seizure first aid recommendations generated by the publicly available chatbot ChatGPT (GPT-4o model) with those generated by its customised version. The dataset consists of three files. The first file (customisation rules.txt) contains customised text instructions for the chatbot, including definitions of key terms and roles, communication and dialogue style guidelines, a catalogue and description of knowledge base documents, operational recommendations for applying knowledge base documents in dialogue, prohibited actions, barrier mitigation strategies, chatbot phrasing examples, and conversation closure instructions. The second file (instructions.txt) contains four sets of mandatory questions and instructional wordings corresponding to the following emergency scenarios: scenario I – an unconscious victim with ongoing seizures; scenario II – a victim in the postictal period, unconscious, not breathing; scenario III – a victim in the postictal period, unconscious, breathing normally; scenario IV – a victim in the postictal period, conscious. The third file (evaluation results.xlsx) contains the results of a comparative analysis of the effectiveness of the publicly available chatbot (Baseline_# sheets) and its customised version (Custom_# sheets) according to checklists corresponding to the dialogue scenarios.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abdullah Al Zubaer; Michael Granitzer; Jelena Mitrović (2023). Data_Sheet_2_Performance analysis of large language models in the domain of legal argument mining.pdf [Dataset]. http://doi.org/10.3389/frai.2023.1278796.s002

Data_Sheet_2_Performance analysis of large language models in the domain of legal argument mining.pdf

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Nov 17, 2023
Dataset provided by
Frontiers
Authors
Abdullah Al Zubaer; Michael Granitzer; Jelena Mitrović
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.

Search
Clear search
Close search
Google apps
Main menu