Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "alpaca-gpt4"
This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.
Dataset structure
It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has⊠See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.
GPT-3's water consumption for the training phase was estimated at roughly 4.8 billion liters of water, when assuming the model was trained on Microsoft's Iowa data center (OpeanAI has disclosed that the data center was used for training parts of the GPT-4 model). If the model were to have been fully trained in the Washington data center, water consumption could have been as high as 15 billion liters. That would've amounted to more than Microsoft's total water withdrawals in 2023.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study provides a comprehensive review of OpenAIâs Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4âs report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThis study evaluates the diagnostic accuracy of a multimodal large language model (LLM), ChatGPT-4, in recognizing glaucoma using color fundus photographs (CFPs) with a benchmark dataset and without prior training or fine tuning.MethodsThe publicly accessible Retinal Fundus Glaucoma Challenge âREFUGEâ dataset was utilized for analyses. The input data consisted of the entire 400 image testing set. The task involved classifying fundus images into either âLikely Glaucomatousâ or âLikely Non-Glaucomatousâ. We constructed a confusion matrix to visualize the results of predictions from ChatGPT-4, focusing on accuracy of binary classifications (glaucoma vs non-glaucoma).ResultsChatGPT-4 demonstrated an accuracy of 90% with a 95% confidence interval (CI) of 87.06%-92.94%. The sensitivity was found to be 50% (95% CI: 34.51%-65.49%), while the specificity was 94.44% (95% CI: 92.08%-96.81%). The precision was recorded at 50% (95% CI: 34.51%-65.49%), and the F1 Score was 0.50.ConclusionChatGPT-4 achieved relatively high diagnostic accuracy without prior fine tuning on CFPs. Considering the scarcity of data in specialized medical fields, including ophthalmology, the use of advanced AI techniques, such as LLMs, might require less data for training compared to other forms of AI with potential savings in time and financial resources. It may also pave the way for the development of innovative tools to support specialized medical care, particularly those dependent on multimodal data for diagnosis and follow-up, irrespective of resource constraints.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Data Description
Here, we release the full long SFT training dataset of ChatQA2. It consists of two parts: long_sft and NarrativeQA_131072. The long_sft dataset is built and derived from existing datasets: LongAlpaca12k, GPT-4 samples from Open Orca, and Long Data Collections. The NarrativeQA_131072 dataset is synthetically generated from NarrativeQA by adding related paragraphs to the given ground truth summary. For the first two steps training of ChatQA-2, we follow ChatQA1.5.⊠See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data.
Dataset Card for "ScaleBiO-Train-Open-Orca-1million-gpt-4"
More Information needed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.
Dataset Composition:
curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot
Intended Use:
Fine-tuning and advancing Homepage2Vec or similar website classification models
Research on LLM-generated datasets for text classification tasks
Exploration of multilingual website classification
Additional Information:
Project and report repository: https://github.com/CS-433/ml-project-2-mlp
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
https://choosealicense.com/licenses/gpl/https://choosealicense.com/licenses/gpl/
Dataset Overview The Open Orca Enhanced Dataset is meticulously designed to improve the performance of automated essay grading models using deep learning techniques. This dataset integrates robust data instances from the FLAN collection, augmented with responses generated by GPT-3.5 or GPT-4, creating a diverse and context-rich resource for training models. Dataset Structure The dataset is structured in a tabular format, with the following key fields: id: A unique identifier for each data⊠See the full description on the dataset page: https://huggingface.co/datasets/mohamedemam/Essay-quetions-auto-grading-arabic.
Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95\% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ``hybrid'' coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggests that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository includes the materials of the work entitled"GPT-Powered Elicitation Interview Script Generator for Requirements Engineering Training",by Binnur Görer and Fatma BaĆak Aydemir, which is accepted to be presented in RE@Next! track of RE'24.Files to create Custom GPT: We provide three domain-specific files to augment GPT-4.common_mistakes.txt: includes the details about the common mistakes encountered in requirement elicitation interviews.example_conversation_simplified.txt: includes a sample requirements elicitation interview script.guidelines_for_interviewers_questions_short.docx: includes best practices of requirements elicitation interviewing.Experiment Output: We provide generated interview scripts for four different domains;Social housing app: Interview Script Social Housing App.docxDigital health tracking app: Interview Script for Digital Health Tracking App.docxFood delivery app: Interview Script for Food Delivery App.docxMeeting scheduler system: Interview Script for Meeting Scheduler System.docx gpt_conversation_links.txt includes the ChatGPT links for each generated interview script.Evaluation: The grading rubric used in expert study is provided in REI Grading - Expert Study.xlsx
GPTFuzzer is a fascinating project that explores red teaming of large language models (LLMs) using auto-generated jailbreak prompts. Let's dive into the details:
Project Overview: GPTFuzzer aims to assess the security and robustness of LLMs by crafting prompts that can potentially lead to harmful or unintended behavior.
The project focuses on GPT-3 and similar models.
Datasets:
The datasets used in GPTFuzzer include:
Harmful Questions: Sampled from public datasets like llm-jailbreak-study and hh-rlhf. Human-Written Templates: Collected from llm-jailbreak-study. Responses: Gathered by querying models like Vicuna-7B, ChatGPT, and Llama-2-7B-chat.
Models:
The judgment model is a finetuned RoBERTa-large model. The training code and data are available in the repository.
During fuzzing experiments, the model is automatically downloaded and cached.
Updates:
The project has received recognition and awards at conferences like Geekcon 2023. The team continues to improve the codebase and aims to build a general black-box fuzzing framework for LLMs.
Source: Conversation with Bing, 3/17/2024 (1) sherdencooper/GPTFuzz: Official repo for GPTFUZZER - GitHub. https://github.com/sherdencooper/GPTFuzz. (2) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md. (3) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md. (4) undefined. https://avatars.githubusercontent.com/u/37368657?v=4. (5) undefined. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md?raw=true. (6) undefined. https://desktop.github.com. (7) undefined. https://github.com/sherdencooper/GPTFuzz/raw/master/README.md. (8) undefined. https://opensource.org/licenses/MIT. (9) undefined. https://camo.githubusercontent.com/a4426cbe5c21edb002526331c7a8fbfa089e84a550567b02a0d829a98b136ad0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667. (10) undefined. https://img.shields.io/badge/License-MIT-yellow.svg. (11) undefined. https://arxiv.org/pdf/2309.10253.pdf. (12) undefined. https://sherdencooper.github.io/. (13) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&. (14) undefined. http://www.dataisland.org/. (15) undefined. http://xinyuxing.org/. (16) undefined. https://geekcon.darknavy.com/2023/china/en/index.html. (17) undefined. https://avatars.githubusercontent.com/u/35443979?v=4. (18) undefined. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md?raw=true. (19) undefined. https://docs.github.com/articles/about-issue-and-pull-request-templates. (20) undefined. https://github.com/CriticalPulsar/GPTFuzz/raw/master/README.md. (21) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&hl=en.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Bing x GPT-4 Synthetic Query Dataset
This dataset was used in the paper GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. Refer to https://arxiv.org/abs/2402.16829 for details. The code for generating the data is available at https://github.com/avsolatorio/GISTEmbed.
Citation
@article{solatorio2024gistembed, title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}⊠See the full description on the dataset page: https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OpenAI declared that GPT-4 performed better in academic and certain specialty areas. Medical licensing exams assess the clinical competence of doctors. We aimed to investigate for the first time howChatGPT will perform in the Turkish Neurology Proficiency Exam. TheGPT-4 version of ChatGPT was used in the study due to the presence ofimage-based questions. The multiple-choice sections of the TurkishNeurology Proficiency Exams conducted by the Turkish NeurologyAssociation (TND) in 2021, 2022 and 2023 were applied to ChatGPT-4.Questions and multiple-choice answers were used in their originalTurkish forms in the official national examination standards. The success rate in all three exams ranges from 79% to 82%. There were common and different mistakes in the two trials. When the incorrect answers were re-evaluated, the correct answers were obtained. This is the first study to investigate the performance of ChatGPT on the real Neurology Proficiency Examination. The success rate was shown to be above GPT-3.5. Furthermore, this study showed that translating questions from the original language into English did not affect the performance of GPT-4 in medical licensing exams, unlike GPT-3.5. It is therefore very important that the information obtained is accurate and verifiable. ChatGPT-4âs ability to find the correct answer with feedback on questions that it initially answered incorrectly may be due to the modelâs ability to generate flexible and adaptive answers.These models should be used carefully and consciously, knowing that they will not always give the correct answer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge-Enhanced Winograd Schema Challenge KE-WSC is an upgraded version of the original WSC dataset. It includes the following extensions:
The dataset can be used to study knowledge explanation in models and enables knowledge-enhanced machine learning. It can be used to train a classification or generative models. It comprises 601 training samples, 200 validation samples, and 200 test samples, and is released in a tabular TSV format. The README.txt file contains a description of the attributes. The test set labels are private, as the dataset is integrated into the SloBENCH evaluation framework (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.
References: Levesque, H., Davis, E., & Morgenstern, L. (2012, May). The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains scripts for collecting data from TikTok, including videos, metadata, and visual insights. The data collection process involves several steps, each outlined below along with instructions for setup and usage.
pyktok
, opencv-python
, requests
pyktok
library to download TikTok videos based on video IDs extracted during metadata collection.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the research paper entitled:
Harnessing ChatGPT and GPT-4 for Evaluating the Rheumatology Questions of the Spanish Access Exam to Specialized Medical Training.
Alfredo Madrid-GarcĂa, Zulema Rosales-Rosado, Dalifer Dayanira Freites-NĂșñez, InĂ©s PĂ©rez-Sancristobal, Esperanza Pato-Cour, Chamaida Plasencia-RodrĂguez, Luis Cabeza-Osorio, Lydia Abasolo Alcazar, Leticia Leon Mateos, BenjamĂn FernĂĄndez-GutiĂ©rrez, Luis RodrĂguez-RodrĂguez
medRxiv 2023.07.21.23292821; doi: https://doi.org/10.1101/2023.07.21.23292821
The dataset contains 145 rheumatology-related questions extracted from the Spanish MIR exams held between the academic years 2009-2010 to 2022-2023. The questions are evaluated by ChatGPT, GPT-4, BARD and CLAUDE. Six rheumatologists assess the clinical reasoning of ChatGPT and GPT-4.
A more detailed description of the dataset can be found in "Dataset Description" Sheet
đŻ DART-Math
Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
đ Paper@arXiv | đ€ Datasets&Models@HF | đ± Code@GitHub
đŠ Thread@X(Twitter) | đ¶ äžæććźą@ç„äč | đ Leaderboard@PapersWithCode | đ BibTeX
Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.
DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.
Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.
Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.
Math SFT Dataset | # of Samples | MATH | GSM8K | College | Synthesis Agent(s) | Open-Source |
---|---|---|---|---|---|---|
WizardMath | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | â |
MetaMathQA | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | â |
MMIQC | 2294k | 37.4 | 75.4 | 28.5 | GPT-4+GPT-3.5+Human | â |
Orca-Math | 200k | -- | -- | -- | GPT-4 | â |
Xwin-Math-V1.1 | 1440k | 45.5 | 84.9 | 27.6 | GPT-4 | â |
KPMath-Plus | 1576k | 46.8 | 82.1 | -â | GPT-4 | â |
MathScaleQA | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | â |
DART-Math-Uniform | 591k | 43.5 | 82.6 | 26.9 | DeepSeekMath-7B-RL | â |
DART-Math-Hard | 585k | 45.5 | 81.1 | 29.4 | DeepSeekMath-7B-RL | â |
MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.
Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.
Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:
1) Uniform, which involves sampling responses for each query until each query accumulates $k_u$ correct responses, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).
See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.
Citation If you find our data, model or code useful for your work, please kindly cite our paper:
latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The importance of drug toxicity assessment lies in ensuring the safety and efficacy of the pharmaceutical compounds. Predicting toxicity is crucial in drug development and risk assessment. This study compares the performance of GPT-4 and GPT-4o with traditional deep-learning and machine-learning models, WeaveGNN, MorganFP-MLP, SVC, and KNN, in predicting molecular toxicity, focusing on bone, neuro, and reproductive toxicity. The results indicate that GPT-4 is comparable to deep-learning and machine-learning models in certain areas. We utilized GPT-4 combined with molecular docking techniques to study the cardiotoxicity of three specific targets, examining traditional Chinese medicinal materials listed as both food and medicine. This approach aimed to explore the potential cardiotoxicity and mechanisms of action. The study found that components in Black Sesame, Ginger, Perilla, Sichuan Pagoda Tree Fruit, Galangal, Turmeric, Licorice, Chinese Yam, Amla, and Nutmeg exhibit toxic effects on cardiac target Cav1.2. The docking results indicated significant binding affinities, supporting the hypothesis of potential cardiotoxic effects.This research highlights the potential of ChatGPT in predicting molecular properties and its significance in medicinal chemistry, demonstrating its facilitation of a new research paradigm: with a data set, high-accuracy learning models can be generated without requiring computational knowledge or coding skills, making it accessible and easy to use.
The OCW dataset is for evaluating creative problem solving tasks by curating the problems and human performance results from the popular British quiz show Only Connect.
The OCW dataset contains 618 connecting wall puzzles and solutions in total from 15 seasons of the show. Each show episode has two walls.
The dataset has two tasks: Task 1 (Grouping), and Task 2 (Connections) are identical to the quiz-showâs human participant tasks.
Task 1 (Groupings) is evaluated via six metrics: number of solved walls, number of correct groups (max. four per wall), Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI), Fowlkes Mallows Score (FMS), and Wasserstein Distance (WD), normalized to (0, 1) range, between predicted and ground-truth labels.
Task 2 (Connections) is evaluated with three metrics: exact string matching, ROUGE-1 F1, and BERTScore F1.
Baseline results with pre-trained language models and with few-shot In-context Learning (ICL) with LLMs such as GPT-4 are available here:
"Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset" Saeid Alavi Naeini, Raeid Saqur, Mozhgan Saeidi, John Giorgi, Babak Taati. 2023 https://neurips.cc/virtual/2023/poster/73547
MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "alpaca-gpt4"
This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.
Dataset structure
It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has⊠See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.