Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Emotion analysis from app reviews - Replication packageFull paper accepted at the 33rd IEEE International Requirements Engineering 2025 conference (Research Track).📚 Summary of artifactThis artifact supports the replication of the study presented in the paper "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews", accepted at the 33rd IEEE International Requirements Engineering 2025 conference. It provides a comprehensive framework for conducting fine-grained emotion analysis from mobile app reviews using both human and large language model (LLM)-based annotations.The artifact includes:Input: A dataset of user reviews, emotion annotation guidelines, and ground truth annotations from human annotators.Process: Scripts for generating emotion annotations via LLMs (GPT-4o, Mistral Large 2, and Gemini 2.0 Flash), splitting annotations into iterations, computing agreement metrics (e.g., Cohen’s Kappa), and evaluating correctness and cost-efficiency.Output: Annotated datasets (human and LLM-generated), agreement analyses, emotion statistics, and evaluation metrics including accuracy, precision, recall, and F1 score.The artifact was developed to ensure transparency, reproducibility, and extensibility of the experimental pipeline. It enables researchers to replicate, validate, or extend the emotion annotation process across different LLMs and configurations, contributing to the broader goal of integrating emotional insights into requirements engineering practices.🔎 Artifact LocationThe artifact is available at https://doi.org/10.6084/m9.figshare.28548638.Find how to cite this replication package and author information at the end of this README file.📂 Description of ArtifactLiterature review: results from the literature review on opinion mining and emotion analysis within the context of software-based reviews.Data: data used in the study, including user reviews (input), human annotations (ground truth), and LLM-based annotations (generated by the assistants).Code: code used in the study, including the generative annotation, data processing, and evaluation.📖 Literature reviewStudy selection and results are available in the literature_review/study-selection.xlsx file. This file contains the following sheets:iteration_1_IC_analysis: results from the first iteration of the inclusion criteria analysis.iteration_1_feature_extraction: results from the first iteration of the feature extraction analysis.iteration_2_IC_analysis: results from the second iteration of the inclusion criteria analysis.iteration_2_feature_extraction: results from the second iteration of the feature extraction analysis.iteration_3_IC_analysis: results from the third iteration of the inclusion criteria analysis.iteration_3_feature_extraction: results from the third iteration of the feature extraction analysis.emotions: statistical analysis of emotions covered by emotion taxonomies in the selected studies.🗃️ DataThe data root folder contains the following files:reviews.json contains the reviews used in the study.guidelines.txt contains a .txt version of the annotation guidelines.ground-truth.xlsx contains the ground truth (human agreement) annotations for the reviews.In addition, the data root folder contains the following subfolders:assistants contains the IDs of the assistants used for the generative annotation (see LLM-based annotation).annotations contains the results of the human and LLM-based annotation: -- iterations contains both human and LLM-based annotations for each iteration. -- llm-annotations contains the LLM-based annotations for each assistance, including results for various temperature values: low (0), medium (0.5), and high (1) (see LLM-based annotation).agreements contains the results of the agreement analysis between the human and LLM-based annotations (see Data Processing).evaluation contains the results of the evaluation of the LLM-based annotations (see Evaluation), including statistics, Cohen's Kappa, correctness, and cost-efficiency analysis, which includes token usage and human annotation reported times.⚙️ System RequirementsAll artifacts in this replication package are runnable in any operating system with the following requirements:OS: Linux Based OS // Mac-OS // Windows With Unix Like Shells For Example Git Bash CLIPython 3.10Additionally, you will also need at least one API key for OpenAI, Mistral or Gemini. See Step 1 in Usage Instructions & Steps to reproduce.💻 Installation Instructions⚙️ Install requirementsCreate a virtual environment:python -m venv venvActivate the virtual environment. For Linux Based OS Or Mac-OS.source venv/bin/activateFor Windows With Unix Like Shells (for example Git Bash CLI):source venv/Scripts/activateInstall Python dependency requirements running the following command.pip install -r requirements.txtNow you're ready to start the annotation process!💻 Usage Instructions & Steps to reproduceWe structure the code available in this replication package based on the stages involved in the LLM-based annotation process.🤖 LLM-based annotationThe llm_annotation folder contains the code used to generate the LLM-based annotations.There are two main scripts:create_assistant.py is used to create a new assistant with a particular provider and model. This class includes the definition of a common system prompt across all agents, using the data/guidelines.txt file as the basis.annotate_emotions.py is used to annotate a set of emotions using a previously created assistant. This script includes the assessment of the output format, as well as some common metrics for cost-efficiency analysis and output file generation.Our research includes an LLM-based annotation experimentation with 3 LLMs: GPT-4o, Mistral Large 2, and Gemini 2.0 Flash. To illustrate the usage of the code, in this README we refer to the code execution for generating annotations using GPT-4o. However, full code is provided for all LLMs.🔑 Step 1: Add your API keyIf you haven't done this already, add your API key to the .env file in the root folder. For instance, for OpenAI, you can add the following:OPENAI_API_KEY=sk-proj-...🛠️ Step 2: Create an assistantCreate an assistant using the create_assistant.py script. For instance, for GPT-4o, you can run the following command:python ./code/llm_annotation/create_assistant_openai.py --guidelines ./data/guidelines.txt --model gpt-4oThis will create an assistant loading the data/guidelines.txt file and using the GPT-4o model.📝 Step 3: Annotate emotionsAnnotate emotions using the annotate_emotions.py script. For instance, for GPT-4o, you can run the following command using a small subset of 100 reviews from the ground truth as an example:python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth-small.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10For annotating the whole dataset, run the following command (IMPORTANT: this will take more than 60 minutes due to OpenAI, Mistral and Gemini consumption times!):python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10Parameters include:input: path to the input file containing the set of reviews to annotate (e.g., data/ground-truth.xlsx).output: path to the output folder where annotations will be saved (e.g., data/annotations/llm/temperature-00/).batch_size: number of reviews to annotate for each user request (e.g., 10).model: model to use for the annotation (e.g., gpt-4o).temperature: temperature for the model responses (e.g., 0).sleep_time: time to wait between batches, in seconds (e.g., 10).This will annotate the emotions using the assistant created in the previous step, creating a new file with the same format as in the data/ground-truth.xlsx file.🔄 Data processingIn this stage, we refactor all files into iterations and we consolidate the agreement between multiple annotators or LLM runs. These logic serves both for human and LLM annotations. Parameters can be updated to include more annotators or LLM runs.✂️ Step 4: Split annotations into iterationsWe split the annotations into iterations based on the number of annotators or LLM runs. For instance, for GPT-4o (run 0), we can run the following command:python code/data_processing/split_annotations.py --input_file data/annotations/llm/temperature-00/gpt-4o-0-annotations.xlsx --output_dir data/annotations/iterations/This facilitates the Kappa analysis and agreement in alignment with each human iteration.🤝 Step 5: Analyse agreementWe consolidate the agreement between multiple annotators or LLM runs. For instance, for GPT-4o, we can run the following command to use the run from Step 3 (run 0) and three additional annotations (run 1, 2, and 3) already available in the replication package (NOTE: we simplify the process to speed up the analysis and avoid delays in annotation):python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-0 gpt-4o-1 gpt-4o-2 gpt-4o-3For replicating our original study, run the following:python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-1 gpt-4o-2 gpt-4o-3📊 EvaluationAfter consolidating agreements, we can evaluate both the Cohen's Kappa agreement and correctness between the human and LLM-based annotations. Our code allows any combination of annotators and LLM runs.📈 Step 6: Emotion statisticsWe evaluate the statistics of the emotions in the annotations, including emotion frequency, distribution, and correlation between emotions. For instance, for GPT-4o and the example in this README file, we can run the following command:python code/evaluation/emotion_statistics.py --input-file
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview:
This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:
Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conference
on Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.
https://arxiv.org/abs/2411.13485.
Briefly, each row in the datasets was produced as follows:
1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.
2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.
3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.
For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.
License:
This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:
Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
NEAR Cortex-1 Market Analysis Dataset
Dataset Summary
This dataset contains blockchain market analyses combining historical and real-time data with chain-of-thought reasoning. The dataset includes examples from Ethereum, Bitcoin, and NEAR chains, demonstrating high-quality market analysis with explicit calculations, numerical citations, and actionable insights. The dataset has been enhanced with examples generated by GPT-4o and Claude 3.7 Sonnet, providing diverse… See the full description on the dataset page: https://huggingface.co/datasets/Jarrodbarnes/cortex-1-market-analysis.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Synthetic QA Dataset for Biomedical Paper Analysis (GPT-4o Generated)
This dataset consists of synthetically generated question-answer pairs designed to simulate the process of answering high-level research questions about biomedical papers. It was created using OpenAI's GPT-4o model and is tailored for fine-tuning or evaluating models on tasks such as biomedical reading comprehension, information extraction, and reasoning.
Dataset Structure
Each data sample is a JSON… See the full description on the dataset page: https://huggingface.co/datasets/AbrehamT/classified_papers.
Artificial Relationships in Fiction Dataset Description Artificial Relationships in Fiction (ARF) is a synthetically annotated dataset for Relation Extraction (RE) in fiction, created from a curated selection of literary texts sourced from Project Gutenberg. The dataset captures the rich, implicit relationships within fictional narratives using a novel ontology and GPT-4o for annotation. ARF is the first large-scale RE resource designed specifically for literary texts, advancing both NLP model training and computational literary analysis.
Dataset Configurations and Features Configurations
fiction_books: Metadata-rich corpus of 6,322 public domain fiction books (1850–1950) with inferred author gender and thematic categorization. fiction_books_in_chunks: Books segmented into 5-sentence chunks (5.96M total), preserving narrative coherence via 1-sentence overlap. fiction_books_with_relations: A subset of 95,475 text chunks annotated with 128,000+ relationships using GPT-4o and a fiction-specific ontology.
Description: Contains the full text and metadata of 6,322 English-language fiction books from Project Gutenberg. Features: book_id: Unique Project Gutenberg ID. title: Title of the book. author: Author name. author_birth_year / author_death_year: Author lifespan. release_date: PG release date. subjects: List of thematic topics (mapped to 51 standardized themes). gender: Inferred author gender (via GPT-4o). text: Cleaned full book text. Use Case: Supports thematic and demographic analysis of literary texts.
Description: Each book is segmented into overlapping five-sentence text chunks to enable granular NLP analysis. Features: book_id, chunk_index: Book and chunk identifiers. text_chunk: Five-sentence excerpt from the book. Use Case: Facilitates sequence-level tasks like coreference resolution or narrative progression modeling.
Description: This subset corresponds to the Artificial Relationships in Fiction (ARF) dataset proposed in the LaTeCH-CLfL 2025 paper "Artificial Relationships in Fiction: A Dataset for Advancing NLP in Literary Domains". Features: book_id, chunk_index: Identifiers. text_chunk: Five-sentence text segment. relations: A list of structured relation annotations, each containing: entity1, entity2: Text spans. entity1Type, entity2Type: Entity types based on ontology. relation: Relationship type.
Use Case: Ideal for training and evaluating RE models in fictional narratives, studying character networks, and generating structured data from literary texts.
ARF Dataset Structure (config 'synthetic_relations_in_fiction_books') Each annotated relation is formatted as: json { "entity1": "Head Entity text", "entity2": "Tail Entity text", "entity1Type": "Head entity type", "entity2Type": "Tail entity type", "relation": "Relation type" }
Example: json { "entity1": "Vortigern", "entity2": "castle", "entity1Type": "PER", "entity2Type": "FAC", "relation": "owns" }
Entity Types (11) | Entity Type | Description | |-------------|-------------| | PER | Person or group of people | | FAC | Facility – man-made structures for human use | | LOC | Location – natural or loosely defined geographic regions | | WTHR | Weather – atmospheric or celestial phenomena | | VEH | Vehicle – transport devices (e.g., ship, carriage) | | ORG | Organization – formal groups or institutions | | EVNT | Event – significant occurrences in narrative | | TIME | Time – chronological or historical expressions | | OBJ | Object – tangible items in the text | | SENT | Sentiment – emotional states or feelings | | CNCP | Concept – abstract ideas or motifs |
Relation Types (48) | Relation Type | Entity 1 Type | Entity 2 Type | Description | |----------------------|------------------|-------------------|-------------------------------------------| | parent_father_of | PER | PER | Father relationship | | parent_mother_of | PER | PER | Mother relationship | | child_of | PER | PER | Child to parent | | sibling_of | PER | PER | Sibling relationship | | spouse_of | PER | PER | Spousal relationship | | relative_of | PER | PER | Extended family relationship | | adopted_by | PER | PER | Adopted by another person | | companion_of | PER | PER | Companionship or ally | | friend_of | PER | PER | Friendship | | lover_of | PER | PER | Romantic relationship | | rival_of | PER | PER | Rivalry | | enemy_of | PER/ORG | PER/ORG | Hostile or antagonistic relationship | | inspires | PER | PER | Inspires or motivates | | sacrifices_for | PER | PER | Makes a sacrifice for | | mentor_of | PER | PER | Mentorship or guidance | | teacher_of | PER | PER | Formal teaching relationship | | protector_of | PER | PER | Provides protection to | | employer_of | PER | PER | Employment relationship | | leader_of | PER | ORG | Leader of an organization | | member_of | PER | ORG | Membership in an organization | | lives_in | PER | FAC/LOC | Lives in a location | | lived_in | PER | TIME | Historically lived in | | visits | PER | FAC | Visits a facility | | travel_to | PER | LOC | Travels to a location | | born_in | PER | LOC | Birthplace | | travels_by | PER | VEH | Travels by a vehicle | | participates_in | PER | EVNT | Participates in an event | | causes | PER | EVNT | Causes an event | | owns | PER | OBJ | Owns an object | | believes_in | PER | CNCP | Believes in a concept | | embodies | PER | CNCP | Embodies a concept | | located_in | FAC | LOC | Located in a place | | part_of | FAC/LOC/ORG | FAC/LOC/ORG | Part of a larger entity | | owned_by | FAC/VEH | PER | Owned by someone | | occupied_by | FAC | PER | Occupied by someone | | used_by | FAC | ORG | Used by an organization | | affects | WTHR | LOC/EVNT | Weather affects location or event | | experienced_by | WTHR | PER | Weather experienced by someone | | travels_in | VEH | LOC | Vehicle travels in a location | | based_in | ORG | LOC | Organization based in a location | | attended_by | EVNT | PER | Event attended by person | | ends_in | EVNT | TIME | Event ends at a time | | occurs_in | EVNT | LOC/TIME | Event occurs in a place or time | | features | EVNT | OBJ | Event features an object | | stored_in | OBJ | LOC/FAC | Object stored in a place | | expressed_by | SENT | PER | Sentiment expressed by person | | used_by | OBJ | PER | Object used by person | | associated_with | CNCP | EVNT | Concept associated with event |
Dataset Statistics | Metric | Value | |----------------------------|------------| | Books | 96 | | Authors | 91 | | Gender Ratio (M/F) | 55% / 45% | | Subgenres | 51 | | Annotated Chunks | 95,475 | | Relations per Chunk | 1.34 avg | | Chunks with No Relations | 35,230 | | Total Relations | ~128,000 |
Methodology
Source Texts: English-language fiction from PG bookshelves: Fiction, Children & YA, Crime/Mystery. Annotation Model: GPT-4o via custom prompt integrating strict ontologies. Sampling: Balanced author gender and thematic distributions. Ontology Adherence: <0.05% deviation for entities; 2.95% for relations. Format: Structured JSON, optimized for NLP pipelines.
Applications
Fine-tuning RE Models: Adapt models to literary domains with implicit, evolving relationships. Computational Literary Studies: Analyze character networks, thematic evolution, and genre patterns. Creative AI: Enhance AI-driven storytelling, character consistency, and world-building tools.
Citation If you use this dataset in your research, please cite:
bibtex @inproceedings{christou-tsoumakas-2025-artificial, title = "Artificial Relationships in Fiction: A Dataset for Advancing {NLP} in Literary Domains", author = "Christou, Despina and Tsoumakas, Grigorios", editor = "Kazantseva, Anna and Szpakowicz, Stan and Degaetano-Ortlieb, Stefania and Bizzoni, Yuri and Pagel, Janis", booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)", month = may, year = "2025", address = "Albuquerque, New Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.latechclfl-1.13/", pages = "130--147", ISBN = "979-8-89176-241-1" }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This pilot study is the first phase of a broader project aimed at developing an explainable artificial intelligence (AI) tool to support the ethical evaluation of Japanese-language clinical research documents. The tool is explicitly not intended to assist document drafting. We assessed the baseline performance of generative AI—Generative Pre-trained Transformer (GPT)-4 and GPT-4o—in analyzing clinical research protocols and informed consent forms (ICFs). The goal was to determine whether these models could accurately and consistently extract ethically relevant information, including the research objectives and background, research design, and participant-related risks and benefits. First, we compared the performance of GPT-4 and GPT-4o using custom agents developed via OpenAI’s Custom GPT functionality (hereafter “GPTs”). Then, using GPT-4o alone, we compared outputs generated by GPTs optimized with customized Japanese prompts to those generated by standard prompts. GPT-4o achieved 80% agreement in extracting research objectives and background and 100% in extracting research design, while both models demonstrated high reproducibility across ten trials. GPTs with customized prompts produced more accurate and consistent outputs than standard prompts. This study suggests the potential utility of generative AI in pre-institutional review board (IRB) review tasks; it also provides foundational data for future validation and standardization efforts involving retrieval-augmented generation and fine-tuning. Importantly, this tool is intended not to automate ethical review but rather to support IRB decision-making. Limitations include the absence of gold standard reference data, reliance on a single evaluator, lack of convergence and inter-rater reliability analysis, and the inability of AI to substitute for in-person elements such as site visits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the findings in the preprint 'Academic collaboration on large language model studies increases overall but varies across disciplines.' The study aims to explore the application of large language models (LLMs) in scientific disciplines and their implications for interdisciplinary collaboration.
To build LLM paper group, we start with a broad search using general terms related to LLMs and popular models based on the MMLU benchmark spanning from October 2018 to September 2024. We apply this search to the title and abstract to avoid excessive noise in the dataset and then undergo a series of filtering stepsto enhance relevance and remove duplicates. The resulting dataset contains 59,293 papers.
In addition to the paper group in the topic of LLMs, we establish two control groups. The first control group focuses on machine learning (ML) papers. We select ML as a control because it is a well-established field from which LLM emerged as a subfield. To construct this group, we collect a random sampling of 70,945 papers containing the phrase ''machine learning'' in either their title or abstract. To provide an even broader perspective beyond AI-related fields, we create a second control group consisting of a random sample of 73,110 papers from all other research categories---specifically, papers that belong neither to the ML nor LLM categories.
The three files below contain the cleaned samples collected from OpenAlex, which are derived from the original files.
LLM: llm-cleaned-samples.csv
ML: ml-cleaned-samples.csv
Non-LLM/ML: non-llm-cleaned-samples.csv
The three zip files below contain author affiliation information (including departmental discipline) extracted by GPT-4o-mini to support the departmental analysis in the paper:
LLM: llm-author-affiliations.zip
ML: ml-author-affiliations.zip
Non-LLM/ML: non-llm-author-affiliations.zip
The three files below contain the paper information used to support all the analysis in our paper:
LLM: llm-information-entropy.csv
ML: ml-information-entropy.csv
Non-LLM/ML: non-llm-information-entropy.csv
If you have any additional questions, please feel free to contact lingyaol@umich.edu or lydinh@usf.edu.
Project OverviewThis dataset supports the paper titled "Self-Help AI Psychological Counseling System Based on Large Language Models and Its Effectiveness Evaluation". It includes all configuration files, source code, and experimental data used in the two-stage study. Study 1 focused on system construction and prompt optimization, while Study 2 evaluated the mental health effects of the AI counseling system through a randomized controlled trial (RCT).Directory Structure1. BOT_SYSTEM_SET/ – AI System ConfigurationLLMs_SET.json: Core API parameter settings for GPT-4o, including temperature, max_tokens, top_p, frequency_penalty, and presence_penalty.SYSTEM_CoT_PROMPT.json: Prompt engineering file containing the System Prompt and Chain-of-Thought strategy for guiding the LLM to act as a professional psychological counselor.2. DATA_AND_CODE/ – Experimental Data and Analysis ScriptsStudy 1: System construction and prompt optimizationE1_analysis_code.py: Python script for analyzing model evaluation data.E1_results.txt: Comparison results of model performance before and after prompt engineering.E1_scb.csv: Evaluation data of baseline model using simple role prompts.E1_scp.csv: Evaluation data of baseline model using CoT-enhanced prompts.Study 2: Randomized Controlled Trial (RCT) for effectiveness evaluationE2_analysis_code.r: R script for linear mixed model (LMM) analysis.E2_results.txt: Output from the statistical modeling.E2_data.RData: Long-format RData file containing all participants’ measurements at T1, T2, and T3.NotesAll participant data have been anonymized and approved by the institutional ethics committee. Model configurations reflect the GPT-4o API settings as of June 2024. For reproducibility, please follow the script execution order precisely.ContactFor further information or collaboration inquiries, please contact the corresponding author.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching Description VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.
Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement: 1. Enhancing core visual understanding with reduced reliance on prior knowledge. 2. Better integration of language reasoning within visual tasks. 3. Developing training approaches that improve independent visual relationship inference.
Dataset Characteristics
Size: 3,000+ test cases Modalities: Text, image, video Question Types: True/False, multiple-choice, numerical, open-ended Generation Process: Semi-automated with human verification Structure: Organized into three primary categories: General Cue (GC): Evaluates visual element tracking and matching. Object-centric Cue (OC): Focuses on object comparison, counting, and grouping. Person-centric Cue (PC): Measures the ability to compare, count, group, and describe individuals across frames.
Potential Use Cases
Benchmarking vision-language models (VLMs) for real-world multi-modal reasoning. Evaluating visual linking abilities and spatial awareness in large models. Analyzing weaknesses in object permanence and relational inference. Providing insights for improving next-generation vision-language architectures.
Paper & Code 📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues 📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench
BibTeX Citation @misc{zhang2025vlm2benchcloserlookvlms, title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung}, year={2025}, eprint={2502.12084}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.12084} }
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The advent of Large Language Models (LLMs) has created new opportunities for the automation of scientific research spanning both experimental processes and computational simulations. This study explores the feasibility of constructing an autonomous simulation agent (ASA) powered by LLMs through prompt engineering and automated program design to automate the entire simulation research process according to a human-provided research plan. This process includes experimental design, remote upload and simulation execution, data analysis, and report compilation. Using a well-studied simulation problem of polymer chain conformations as a test case, we assessed the long-task completion and reliability of ASAs powered by different LLMs, including GPT-4o, Claude-3.5, etc. Our findings revealed that ASA-GPT-4o achieved near-flawless execution on designated research missions, underscoring the potential of methods like ASA to achieve automation in simulation research processes to enhance research efficiency. The outlined automation can be iteratively performed for up to 20 cycles without human intervention, illustrating the potential of ASA for long-task workflow automation. Additionally, we discussed the intrinsic traits of ASA in managing extensive tasks, focusing on self-validation mechanisms, and the balance between local attention and global oversight.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A dataset of 76 Python programs taken from real Python open source projects (top 100 on GitHub), where each program is a file that has exactly 1 vulnerability as detected by a particular static analyzer (Semgrep), used in the paper Patched MOA: optimizing inference for diverse software development tasks. OpenAI used the synth-vuln-fixes and fine-tuned a new version of gpt-4o is now the SOTA on this benchmark. More details and code is available from their repo.
More details on the benchmark… See the full description on the dataset page: https://huggingface.co/datasets/patched-codes/static-analysis-eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 10 000 simulated observations. It is utilized to explore and apply Large language models for data analysis, via the analytical framework Analysis of Individual Heterogeneity and Discriminatory Accuracy (AIHDA). The dataset is based on a previous study's aggregated results (Öberg J, Khalaf K, Perez Vicente R, Johnell K, Fastbom J, J. M. Geographic and socioeconomic differences in potentially inappropriate medication among older adults – Applying a simplified analysis of individual heterogeneity and discriminatory accuracy (AIHDA) for basic comparisons of healthcare quality. BMC Health Services Research. 2024 (Under peer-review). Empirical patient data must be analyzed within a secure IT environment to ensure confidentiality. By utilizing simulated patient data, we can apply a cloud-based GPT to our analysis, thereby gaining access to computational power and LLM capabilities that would otherwise be inaccessible to us via local LLMs. For the purposes of our study, a simulated database is a suitable solution. The simulated database was created by ChatGPT 4o based on the previous publication already referenced. By doing so, we can illustrate the application of GPT-based analysis in a real-world example of a healthcare quality indicator. The quality indicator, known as potentially inappropriate medication among older adults, is managed by the Swedish National Board of Health and Welfare (NBHW).
Dataset Summary
Synthetic Persian Chatbot Conversational SA – Friendship is a Persian (Farsi) dataset created for the Classification task, with a focus on detecting the emotion "friendship" in chatbot conversations. It is part of the FaMTEB (Farsi Massive Text Embedding Benchmark). The dataset was synthetically generated using GPT-4o-mini and is derived from the broader Synthetic Persian Chatbot Conversational Sentiment Analysis dataset.
Language(s): Persian (Farsi)
Task(s):… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/synthetic-persian-chatbot-conversational-sentiment-analysis-friendship.
Dataset Summary
Synthetic Persian Chatbot Conversational SA – Love is a Persian (Farsi) dataset for the Classification task, specifically focused on detecting the emotion "love" in user-chatbot conversations. It is part of the FaMTEB (Farsi Massive Text Embedding Benchmark). This dataset was synthetically generated using GPT-4o-mini and is a subset of the broader Synthetic Persian Chatbot Conversational Sentiment Analysis collection.
Language(s): Persian (Farsi)
Task(s):… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/synthetic-persian-chatbot-conversational-sentiment-analysis-love.
Dataset Summary
Synthetic Persian Chatbot Conversational SA – Surprise is a Persian (Farsi) dataset for the Classification task, focused on detecting the expression of "surprise" in user-chatbot conversations. It is part of the FaMTEB (Farsi Massive Text Embedding Benchmark). The dataset was synthetically generated using GPT-4o-mini and is a subset of the Synthetic Persian Chatbot Conversational Sentiment Analysis dataset.
Language(s): Persian (Farsi)
Task(s): Classification… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/synthetic-persian-chatbot-conversational-sentiment-analysis-surprise.
Dataset Summary
Synthetic Persian Chatbot Conversational SA – Happiness is a Persian (Farsi) dataset for the Classification task, specifically focused on detecting the emotion "happiness" in user-chatbot conversations. It is part of the FaMTEB (Farsi Massive Text Embedding Benchmark). This dataset was synthetically generated using GPT-4o-mini, and is a subset of the broader Synthetic Persian Chatbot Conversational Sentiment Analysis collection.
Language(s): Persian (Farsi)… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/synthetic-persian-chatbot-conversational-sentiment-analysis-happiness.
Dataset Summary
Synthetic Persian Chatbot Conversational SA – Sadness is a Persian (Farsi) dataset for the Classification task, focused on detecting the expression of "sadness" in user-chatbot conversations. It is part of the FaMTEB (Farsi Massive Text Embedding Benchmark). This dataset was synthetically generated using GPT-4o-mini and is a subset of the broader Synthetic Persian Chatbot Conversational Sentiment Analysis dataset.
Language(s): Persian (Farsi)
Task(s):… See the full description on the dataset page: https://huggingface.co/datasets/MCINext/synthetic-persian-chatbot-conversational-sentiment-analysis-sadness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.