29 datasets found
  1. h

    alpaca-gpt4

    • huggingface.co
    • opendatalab.com
    Updated Apr 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca-gpt4 [Dataset]. https://huggingface.co/datasets/vicgalle/alpaca-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Authors
    Victor Gallego
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-gpt4"

    This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.

      Dataset structure
    

    It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has
 See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.

  2. Estimated water consumption for training GPT-3 2023

    • statista.com
    Updated Nov 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Estimated water consumption for training GPT-3 2023 [Dataset]. https://www.statista.com/statistics/1536925/gpt-3-estimated-water-consumption-training/
    Explore at:
    Dataset updated
    Nov 19, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 2023
    Area covered
    Worldwide
    Description

    GPT-3's water consumption for the training phase was estimated at roughly 4.8 billion liters of water, when assuming the model was trained on Microsoft's Iowa data center (OpeanAI has disclosed that the data center was used for training parts of the GPT-4 model). If the model were to have been fully trained in the Washington data center, water consumption could have been as high as 15 billion liters. That would've amounted to more than Microsoft's total water withdrawals in 2023.

  3. f

    Summary of GPT-4 TR review.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Summary of GPT-4 TR review. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.

  4. f

    Data Sheet 1_Evaluating the strengths and limitations of multimodal...

    • frontiersin.figshare.com
    docx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saif Aldeen AlRyalat; Ayman Mohammed Musleh; Malik Y. Kahook (2024). Data Sheet 1_Evaluating the strengths and limitations of multimodal ChatGPT-4 in detecting glaucoma using fundus images.docx [Dataset]. http://doi.org/10.3389/fopht.2024.1387190.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Frontiers
    Authors
    Saif Aldeen AlRyalat; Ayman Mohammed Musleh; Malik Y. Kahook
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewThis study evaluates the diagnostic accuracy of a multimodal large language model (LLM), ChatGPT-4, in recognizing glaucoma using color fundus photographs (CFPs) with a benchmark dataset and without prior training or fine tuning.MethodsThe publicly accessible Retinal Fundus Glaucoma Challenge “REFUGE” dataset was utilized for analyses. The input data consisted of the entire 400 image testing set. The task involved classifying fundus images into either ‘Likely Glaucomatous’ or ‘Likely Non-Glaucomatous’. We constructed a confusion matrix to visualize the results of predictions from ChatGPT-4, focusing on accuracy of binary classifications (glaucoma vs non-glaucoma).ResultsChatGPT-4 demonstrated an accuracy of 90% with a 95% confidence interval (CI) of 87.06%-92.94%. The sensitivity was found to be 50% (95% CI: 34.51%-65.49%), while the specificity was 94.44% (95% CI: 92.08%-96.81%). The precision was recorded at 50% (95% CI: 34.51%-65.49%), and the F1 Score was 0.50.ConclusionChatGPT-4 achieved relatively high diagnostic accuracy without prior fine tuning on CFPs. Considering the scarcity of data in specialized medical fields, including ophthalmology, the use of advanced AI techniques, such as LLMs, might require less data for training compared to other forms of AI with potential savings in time and financial resources. It may also pave the way for the development of innovative tools to support specialized medical care, particularly those dependent on multimodal data for diagnosis and follow-up, irrespective of resource constraints.

  5. ChatQA2-Long-SFT-data

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2025). ChatQA2-Long-SFT-data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2025
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    Data Description

    Here, we release the full long SFT training dataset of ChatQA2. It consists of two parts: long_sft and NarrativeQA_131072. The long_sft dataset is built and derived from existing datasets: LongAlpaca12k, GPT-4 samples from Open Orca, and Long Data Collections. The NarrativeQA_131072 dataset is synthetically generated from NarrativeQA by adding related paragraphs to the given ground truth summary. For the first two steps training of ChatQA-2, we follow ChatQA1.5.
 See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data.

  6. h

    ScaleBiO-Train-Open-Orca-1million-gpt-4

    • huggingface.co
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ScaleBiO (2024). ScaleBiO-Train-Open-Orca-1million-gpt-4 [Dataset]. https://huggingface.co/datasets/ScaleBiO/ScaleBiO-Train-Open-Orca-1million-gpt-4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 22, 2024
    Dataset authored and provided by
    ScaleBiO
    Description

    Dataset Card for "ScaleBiO-Train-Open-Orca-1million-gpt-4"

    More Information needed

  7. Z

    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • data.niaid.nih.gov
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10413067
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Senghaas, Mika
    Cizinsky, Ludek
    Nutter, Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

    Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

    Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

    curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    Fine-tuning and advancing Homepage2Vec or similar website classification models

    Research on LLM-generated datasets for text classification tasks

    Exploration of multilingual website classification

    Additional Information:

    Project and report repository: https://github.com/CS-433/ml-project-2-mlp

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  8. h

    Essay-quetions-auto-grading-arabic

    • huggingface.co
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    emam (2024). Essay-quetions-auto-grading-arabic [Dataset]. https://huggingface.co/datasets/mohamedemam/Essay-quetions-auto-grading-arabic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2024
    Authors
    emam
    License

    https://choosealicense.com/licenses/gpl/https://choosealicense.com/licenses/gpl/

    Description

    Dataset Overview The Open Orca Enhanced Dataset is meticulously designed to improve the performance of automated essay grading models using deep learning techniques. This dataset integrates robust data instances from the FLAN collection, augmented with responses generated by GPT-3.5 or GPT-4, creating a diverse and context-rich resource for training models. Dataset Structure The dataset is structured in a tabular format, with the following key fields: id: A unique identifier for each data
 See the full description on the dataset page: https://huggingface.co/datasets/mohamedemam/Essay-quetions-auto-grading-arabic.

  9. d

    Replication Data for: Large Language Models as a Substitute for Human...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Replication Data for: Large Language Models as a Substitute for Human Experts in Annotating Political Text [Dataset]. https://search.dataone.org/view/sha256:e5cec1392761939dcd26b8f7739b7f1222e17e59d4fd23d65d168f240d539ec0
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Heseltine, Michael
    Description

    Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95\% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ``hybrid'' coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggests that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.

  10. Experimental materials for the work entitled "GPT-Powered Elicitation...

    • figshare.com
    docx
    Updated May 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    binnur gorer (2024). Experimental materials for the work entitled "GPT-Powered Elicitation Interview Script Generator for Requirements Engineering Training" [Dataset]. http://doi.org/10.6084/m9.figshare.25193657.v6
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 7, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    binnur gorer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository includes the materials of the work entitled"GPT-Powered Elicitation Interview Script Generator for Requirements Engineering Training",by Binnur Görer and Fatma Baßak Aydemir, which is accepted to be presented in RE@Next! track of RE'24.Files to create Custom GPT: We provide three domain-specific files to augment GPT-4.common_mistakes.txt: includes the details about the common mistakes encountered in requirement elicitation interviews.example_conversation_simplified.txt: includes a sample requirements elicitation interview script.guidelines_for_interviewers_questions_short.docx: includes best practices of requirements elicitation interviewing.Experiment Output: We provide generated interview scripts for four different domains;Social housing app: Interview Script Social Housing App.docxDigital health tracking app: Interview Script for Digital Health Tracking App.docxFood delivery app: Interview Script for Food Delivery App.docxMeeting scheduler system: Interview Script for Meeting Scheduler System.docx gpt_conversation_links.txt includes the ChatGPT links for each generated interview script.Evaluation: The grading rubric used in expert study is provided in REI Grading - Expert Study.xlsx

  11. P

    GPTFuzzer Dataset

    • paperswithcode.com
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing (2025). GPTFuzzer Dataset [Dataset]. https://paperswithcode.com/dataset/gptfuzzer
    Explore at:
    Dataset updated
    Jan 21, 2025
    Authors
    Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing
    Description

    GPTFuzzer is a fascinating project that explores red teaming of large language models (LLMs) using auto-generated jailbreak prompts. Let's dive into the details:

    Project Overview: GPTFuzzer aims to assess the security and robustness of LLMs by crafting prompts that can potentially lead to harmful or unintended behavior.

    The project focuses on GPT-3 and similar models.

    Datasets:

    The datasets used in GPTFuzzer include:

    Harmful Questions: Sampled from public datasets like llm-jailbreak-study and hh-rlhf. Human-Written Templates: Collected from llm-jailbreak-study. Responses: Gathered by querying models like Vicuna-7B, ChatGPT, and Llama-2-7B-chat.

    Models:

    The judgment model is a finetuned RoBERTa-large model. The training code and data are available in the repository.

    During fuzzing experiments, the model is automatically downloaded and cached.

    Updates:

    The project has received recognition and awards at conferences like Geekcon 2023. The team continues to improve the codebase and aims to build a general black-box fuzzing framework for LLMs.

    Source: Conversation with Bing, 3/17/2024 (1) sherdencooper/GPTFuzz: Official repo for GPTFUZZER - GitHub. https://github.com/sherdencooper/GPTFuzz. (2) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md. (3) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md. (4) undefined. https://avatars.githubusercontent.com/u/37368657?v=4. (5) undefined. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md?raw=true. (6) undefined. https://desktop.github.com. (7) undefined. https://github.com/sherdencooper/GPTFuzz/raw/master/README.md. (8) undefined. https://opensource.org/licenses/MIT. (9) undefined. https://camo.githubusercontent.com/a4426cbe5c21edb002526331c7a8fbfa089e84a550567b02a0d829a98b136ad0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667. (10) undefined. https://img.shields.io/badge/License-MIT-yellow.svg. (11) undefined. https://arxiv.org/pdf/2309.10253.pdf. (12) undefined. https://sherdencooper.github.io/. (13) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&. (14) undefined. http://www.dataisland.org/. (15) undefined. http://xinyuxing.org/. (16) undefined. https://geekcon.darknavy.com/2023/china/en/index.html. (17) undefined. https://avatars.githubusercontent.com/u/35443979?v=4. (18) undefined. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md?raw=true. (19) undefined. https://docs.github.com/articles/about-issue-and-pull-request-templates. (20) undefined. https://github.com/CriticalPulsar/GPTFuzz/raw/master/README.md. (21) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&hl=en.

  12. h

    covid-bing-query-gpt4

    • huggingface.co
    Updated Dec 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aivin Solatorio (2019). covid-bing-query-gpt4 [Dataset]. https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2019
    Authors
    Aivin Solatorio
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Bing x GPT-4 Synthetic Query Dataset

    This dataset was used in the paper GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. Refer to https://arxiv.org/abs/2402.16829 for details. The code for generating the data is available at https://github.com/avsolatorio/GISTEmbed.

      Citation
    

    @article{solatorio2024gistembed, title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning}
 See the full description on the dataset page: https://huggingface.co/datasets/avsolatorio/covid-bing-query-gpt4.

  13. f

    Data from: Is artificial intelligence successful in the Turkish neurology...

    • tandf.figshare.com
    pdf
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayse Betul Acar; Ece Yanik; Emine Altin; Ozlem Kurtkaya Kocak (2025). Is artificial intelligence successful in the Turkish neurology board exam? [Dataset]. http://doi.org/10.6084/m9.figshare.28637251.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Ayse Betul Acar; Ece Yanik; Emine Altin; Ozlem Kurtkaya Kocak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OpenAI declared that GPT-4 performed better in academic and certain specialty areas. Medical licensing exams assess the clinical competence of doctors. We aimed to investigate for the first time howChatGPT will perform in the Turkish Neurology Proficiency Exam. TheGPT-4 version of ChatGPT was used in the study due to the presence ofimage-based questions. The multiple-choice sections of the TurkishNeurology Proficiency Exams conducted by the Turkish NeurologyAssociation (TND) in 2021, 2022 and 2023 were applied to ChatGPT-4.Questions and multiple-choice answers were used in their originalTurkish forms in the official national examination standards. The success rate in all three exams ranges from 79% to 82%. There were common and different mistakes in the two trials. When the incorrect answers were re-evaluated, the correct answers were obtained. This is the first study to investigate the performance of ChatGPT on the real Neurology Proficiency Examination. The success rate was shown to be above GPT-3.5. Furthermore, this study showed that translating questions from the original language into English did not affect the performance of GPT-4 in medical licensing exams, unlike GPT-3.5. It is therefore very important that the information obtained is accurate and verifiable. ChatGPT-4’s ability to find the correct answer with feedback on questions that it initially answered incorrectly may be due to the model’s ability to generate flexible and adaptive answers.These models should be used carefully and consciously, knowing that they will not always give the correct answer.

  14. E

    Data from: Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0

    • live.european-language-grid.eu
    binary format
    Updated Nov 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23730
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 14, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge-Enhanced Winograd Schema Challenge KE-WSC is an upgraded version of the original WSC dataset. It includes the following extensions:

    • Annotation of semantically or syntactically solvable examples: Some samples from the original dataset can be solved without deeper semantic processing due to the morphologically richness of Slovene. For example, the sentence: “Riba je pojedla črva. Bila je lačna.” requires only the knowledge of gender and does not require any deep semantical processing to infer that the fish was hungry and not the worm. To have a representative set of syntactical samples, we decided to create 197 new examples by modifying the existing ones.
    • Two-Level Knowledge ontology: We developed a hierarchical scheme to categorize knowledge required to successfully solve a problem. In our analysis, we detected 9 high-level knowledge categories (social knowledge, psychological knowledge, etc.) and 37 lower-level more nuanced knowledge (physical laws/the laws of nature, social roles, causal relationships, etc.).
    • Semi-Automatic Explanation Generation: Textual explanations were generated using GPT-4, followed by verification and correction by human annotators to ensure accuracy and clarity. For instance, a textual explanation for the sentence “Pokal ne gre v rjav kovček, ker je prevelik.” is “Če je nekaj preveliko, se ne prilega v manjĆĄi prostor.”.
    • Translation to English: The finalized explanations were translated into English using a trained translator, enabling broader applicability.
    • SPO Triplet Generation: Subject-Predicate-Object triplets were extracted using GPT-4 to highlight key semantic relationships within each example.

    The dataset can be used to study knowledge explanation in models and enables knowledge-enhanced machine learning. It can be used to train a classification or generative models. It comprises 601 training samples, 200 validation samples, and 200 test samples, and is released in a tabular TSV format. The README.txt file contains a description of the attributes. The test set labels are private, as the dataset is integrated into the SloBENCH evaluation framework (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.

    References: Levesque, H., Davis, E., & Morgenstern, L. (2012, May). The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.

  15. Data from: Generative artificial intelligence and machine learning methods...

    • zenodo.org
    zip
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Generative artificial intelligence and machine learning methods to screen social media content [Dataset]. http://doi.org/10.5281/zenodo.14285107
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TikTok Pregnancy-Vape Data Collection README

    Introduction

    This repository contains scripts for collecting data from TikTok, including videos, metadata, and visual insights. The data collection process involves several steps, each outlined below along with instructions for setup and usage.

    Requirements

    • Python 3.x
    • Required Python packages: pyktok, opencv-python, requests
    • Oracle Cloud Vision API credentials (for visual insights extraction)

    Data Collection Steps

    1. Metadata Extraction

    2. Deduplication

    • Combine metadata from multiple hashtag pairs into a single dataset.
    • Use the provided Python script for deduplication based on unique video IDs.
    • Run combineHashtagMetadata.py

    3. Video Download

    • Utilize the pyktok library to download TikTok videos based on video IDs extracted during metadata collection.
    • Modify the script to specify the desired video download settings and output directory.
    • Run pyktokVideoCollection.py

    4. Transcript Generation

    • Extract text overlays and descriptions from videos.
    • Transcribe files using OpenAI's Whisper.
    • Set up whisper according to https://github.com/openai/whisper
    • Run whisperTranscriptGenerator

    5. Object and Text Detection

    • Extract image frames from videos at regular intervals.
    • Run framesGeneration.py
    • Analyze frames using Oracle Cloud Vision API to identify objects, text, faces, and other visual elements.
    • Record attributes such as detected faces, errors, image classification labels, and object detection model versions.
    • This script in oracleFramefeatureExtractor.py

    Usage

    • Run each script sequentially, following the instructions provided within the scripts.
    • Customize script parameters as needed to suit specific data collection requirements.
    • Ensure proper API credentials, permissions and setup Oracle Cloud Vision API according to Oracle documentation.
  16. RheumaMIR

    • zenodo.org
    bin
    Updated Nov 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfredo Madrid GarcĂ­a; Alfredo Madrid GarcĂ­a; Zulema Rosales Rosado; Zulema Rosales Rosado; Dalifer Freites NĂșñez; Dalifer Freites NĂșñez; InĂ©s PĂ©rez San Cristobal; InĂ©s PĂ©rez San Cristobal; Esperanza Pato Cour; Esperanza Pato Cour; Chamaida Plasencia RodrĂ­guez; Chamaida Plasencia RodrĂ­guez; Luis Cabeza Osorio; Luis Cabeza Osorio (2023). RheumaMIR [Dataset]. http://doi.org/10.5281/zenodo.10204293
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alfredo Madrid GarcĂ­a; Alfredo Madrid GarcĂ­a; Zulema Rosales Rosado; Zulema Rosales Rosado; Dalifer Freites NĂșñez; Dalifer Freites NĂșñez; InĂ©s PĂ©rez San Cristobal; InĂ©s PĂ©rez San Cristobal; Esperanza Pato Cour; Esperanza Pato Cour; Chamaida Plasencia RodrĂ­guez; Chamaida Plasencia RodrĂ­guez; Luis Cabeza Osorio; Luis Cabeza Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the research paper entitled:

    Harnessing ChatGPT and GPT-4 for Evaluating the Rheumatology Questions of the Spanish Access Exam to Specialized Medical Training.

    Alfredo Madrid-GarcĂ­a, Zulema Rosales-Rosado, Dalifer Dayanira Freites-NĂșñez, InĂ©s PĂ©rez-Sancristobal, Esperanza Pato-Cour, Chamaida Plasencia-RodrĂ­guez, Luis Cabeza-Osorio, Lydia Abasolo Alcazar, Leticia Leon Mateos, BenjamĂ­n FernĂĄndez-GutiĂ©rrez, Luis RodrĂ­guez-RodrĂ­guez

    medRxiv 2023.07.21.23292821; doi: https://doi.org/10.1101/2023.07.21.23292821

    The dataset contains 145 rheumatology-related questions extracted from the Spanish MIR exams held between the academic years 2009-2010 to 2022-2023. The questions are evaluated by ChatGPT, GPT-4, BARD and CLAUDE. Six rheumatologists assess the clinical reasoning of ChatGPT and GPT-4.

    A more detailed description of the dataset can be found in "Dataset Description" Sheet

  17. P

    DART-Math-Uniform Dataset

    • paperswithcode.com
    Updated Jun 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DART-Math-Uniform Dataset [Dataset]. https://paperswithcode.com/dataset/dart-math-uniform
    Explore at:
    Dataset updated
    Jun 17, 2024
    Description

    🎯 DART-Math

    Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

    📝 Paper@arXiv | đŸ€— Datasets&Models@HF | đŸ± Code@GitHub

    🐩 Thread@X(Twitter) | đŸ¶ äž­æ–‡ćšćźą@矄äčŽ | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

    Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.

    DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.

    Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.

    Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.

    Math SFT Dataset# of SamplesMATHGSM8KCollegeSynthesis Agent(s)Open-Source
    WizardMath96k32.380.423.1GPT-4✗
    MetaMathQA395k29.876.519.3GPT-3.5✓
    MMIQC2294k37.475.428.5GPT-4+GPT-3.5+Human✓
    Orca-Math200k------GPT-4✓
    Xwin-Math-V1.11440k45.584.927.6GPT-4✗
    KPMath-Plus1576k46.882.1-–GPT-4✗
    MathScaleQA2021k35.274.821.8GPT-3.5+Human✗
    DART-Math-Uniform591k43.582.626.9DeepSeekMath-7B-RL✓
    DART-Math-Hard585k45.581.129.4DeepSeekMath-7B-RL✓

    MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.

    Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.

    Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:

    1) Uniform, which involves sampling responses for each query until each query accumulates $k_u$ correct responses, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).

    See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.

    Citation If you find our data, model or code useful for your work, please kindly cite our paper:

    latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }

  18. Data from: Large Language Models as Tools for Molecular Toxicity Prediction:...

    • acs.figshare.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengzheng Yang; Jian Xiu; Weiqi yan; Kaifeng Liu; Huizi Cui; Zhibang Wang; Qizheng He; Yilin Gao; Weiwei Han (2025). Large Language Models as Tools for Molecular Toxicity Prediction: AI Insights into Cardiotoxicity [Dataset]. http://doi.org/10.1021/acs.jcim.4c01371.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    ACS Publications
    Authors
    Hengzheng Yang; Jian Xiu; Weiqi yan; Kaifeng Liu; Huizi Cui; Zhibang Wang; Qizheng He; Yilin Gao; Weiwei Han
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The importance of drug toxicity assessment lies in ensuring the safety and efficacy of the pharmaceutical compounds. Predicting toxicity is crucial in drug development and risk assessment. This study compares the performance of GPT-4 and GPT-4o with traditional deep-learning and machine-learning models, WeaveGNN, MorganFP-MLP, SVC, and KNN, in predicting molecular toxicity, focusing on bone, neuro, and reproductive toxicity. The results indicate that GPT-4 is comparable to deep-learning and machine-learning models in certain areas. We utilized GPT-4 combined with molecular docking techniques to study the cardiotoxicity of three specific targets, examining traditional Chinese medicinal materials listed as both food and medicine. This approach aimed to explore the potential cardiotoxicity and mechanisms of action. The study found that components in Black Sesame, Ginger, Perilla, Sichuan Pagoda Tree Fruit, Galangal, Turmeric, Licorice, Chinese Yam, Amla, and Nutmeg exhibit toxic effects on cardiac target Cav1.2. The docking results indicated significant binding affinities, supporting the hypothesis of potential cardiotoxic effects.This research highlights the potential of ChatGPT in predicting molecular properties and its significance in medicinal chemistry, demonstrating its facilitation of a new research paradigm: with a data set, high-accuracy learning models can be generated without requiring computational knowledge or coding skills, making it accessible and easy to use.

  19. P

    OCW Dataset

    • paperswithcode.com
    • library.toponeai.link
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saeid Naeini; Raeid Saqur; Mozhgan Saeidi; John Giorgi; Babak Taati, OCW Dataset [Dataset]. https://paperswithcode.com/dataset/only-connect-wall-ocw-dataset
    Explore at:
    Authors
    Saeid Naeini; Raeid Saqur; Mozhgan Saeidi; John Giorgi; Babak Taati
    Description

    The OCW dataset is for evaluating creative problem solving tasks by curating the problems and human performance results from the popular British quiz show Only Connect.

    The OCW dataset contains 618 connecting wall puzzles and solutions in total from 15 seasons of the show. Each show episode has two walls.

    The dataset has two tasks: Task 1 (Grouping), and Task 2 (Connections) are identical to the quiz-show’s human participant tasks.

    Task 1 (Groupings) is evaluated via six metrics: number of solved walls, number of correct groups (max. four per wall), Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI), Fowlkes Mallows Score (FMS), and Wasserstein Distance (WD), normalized to (0, 1) range, between predicted and ground-truth labels.

    Task 2 (Connections) is evaluated with three metrics: exact string matching, ROUGE-1 F1, and BERTScore F1.

    Baseline results with pre-trained language models and with few-shot In-context Learning (ICL) with LLMs such as GPT-4 are available here:

    "Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset" Saeid Alavi Naeini, Raeid Saqur, Mozhgan Saeidi, John Giorgi, Babak Taati. 2023 https://neurips.cc/virtual/2023/poster/73547

  20. P

    MATH Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math
    Explore at:
    Dataset updated
    Jan 10, 2025
    Authors
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt
    Description

    MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
alpaca-gpt4 [Dataset]. https://huggingface.co/datasets/vicgalle/alpaca-gpt4

alpaca-gpt4

vicgalle/alpaca-gpt4

Explore at:
175 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2023
Authors
Victor Gallego
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for "alpaca-gpt4"

This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.

  Dataset structure

It contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca. The dataset has
 See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.

Search
Clear search
Close search
Google apps
Main menu