30 datasets found
  1. h

    gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  2. h

    MMMLU

    • huggingface.co
    Updated Sep 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2024). MMMLU [Dataset]. https://huggingface.co/datasets/openai/MMMLU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2024
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multilingual Massive Multitask Language Understanding (MMMLU)

    The MMLU is a widely recognized benchmark of general knowledge attained by AI models. It covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science. We translated the MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases… See the full description on the dataset page: https://huggingface.co/datasets/openai/MMMLU.

  3. h

    openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  4. Engagement with OpenAI and ChatGPT in Italy 2022-2023

    • statista.com
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Engagement with OpenAI and ChatGPT in Italy 2022-2023 [Dataset]. https://www.statista.com/statistics/1379705/italy-openai-chatgpt-engagement/
    Explore at:
    Dataset updated
    Apr 25, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2022 - Jan 2023
    Area covered
    Italy
    Description

    In January 2023, ChatGPT registered over nine million interactions from users in Italy, up by over 300 percent compare to the previous month. By comparison, the OpenAI website registered 1.2 million actions performed by Italian users. At the end of March 2023, the main national privacy regulator in Italy prompted OpenAI to provide information on how and why the company collects user data, if the company wanted to avoid seeing its access to the Italian market blocked.

  5. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  6. Energy consumption when training LLMs in 2022 (in MWh)

    • statista.com
    Updated Sep 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
    Explore at:
    Dataset updated
    Sep 10, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over a thousand-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of 200 Germans in 2022. While not a staggering amount, it is a considerable use of energy.

    Energy savings through AI

    While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a third expect that AI might reduce power consumption by ten to fifteen percent. Considering that much of the world uses mobile phones this would be a considerable energy saver.

    Emissions are considerable

    The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly 500 tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

  7. Z

    Stack Overflow Chat Dutch

    • data.niaid.nih.gov
    • huggingface.co
    • +1more
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset authored and provided by
    Vanroy, Bram
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

    Data Instances

    {'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

    Data Fields

    id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

    input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

    topic: the topic description

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The prompt to translate the topic is:

    TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
    " summarizing manner. Translate accurately and formally. Do not provide any explanation"
    " about the translation and do not include the original title.

    "

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

  8. f

    Summary of GPT-4 TR review.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Summary of GPT-4 TR review. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.

  9. Z

    Alpaca Cleaned Dutch

    • data.niaid.nih.gov
    • huggingface.co
    • +1more
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Alpaca Cleaned Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052362
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset authored and provided by
    Vanroy, Bram
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    This dataset contains 51,712 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Alpaca Cleaned Dataset.

    Data Instances

    { 'id': 7, 'instruction': 'Leg uit waarom de volgende breuk gelijk is aan 1/4', 'input': '4/16', 'output': 'De breuk 4/16 is gelijk aan 1/4 omdat zowel de teller als de ' 'noemer deelbaar zijn door 4. Door zowel de teller als de noemer ' 'door 4 te delen, krijgen we de breuk 1/4.' }

    Data Fields

    id: the ID of the item. The following ID is not included because they could not be translated: [23019]

    instruction: the given instruction input: optional input to accompany the instruction. Can be empty.

    output: the "answer" to the instruction

    Dataset Creation

    The instructions, inputs and outputs were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate is (where src_lang is English and tgt_lang is Dutch):

    TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional input to the task, and the output of the task, from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked instruction:), optional input to the task (marked input:) and output for the task marked with output:; 2. do not translate the identifiers instruction:, input:, and output: but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and input text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the input in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the input, nor the translation in the output (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    This prompt is concatenated with the instruction, optionally the input, and the output. In code, that last part looks like this:

    text = f'instruction: "{instruction}"

    ' if inputstr: text += f'input: "{inputstr}"

    ' text += f'output: "{outputstr}"'

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 1 item (0.0001%) was not successfully translated. The translation was missing the input, instruction, or output keywords where those were expected. The ID for the missing item is [23019].

    Initial data creation of the English dataset by Tatsu lab and cleaned by Yahma.

    Also available on HuggingFace hub (with a more extensive README).

    Licensing Information

    As per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services. Similar to the original Alpaca dataset, this dataset is released under CC NC 4.0.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

  10. h

    openwebtext

    • huggingface.co
    • paperswithcode.com
    • +4more
    Updated Sep 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Gokaslan (2020). openwebtext [Dataset]. https://huggingface.co/datasets/Skylion007/openwebtext
    Explore at:
    Dataset updated
    Sep 28, 2020
    Authors
    Aaron Gokaslan
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    An open-source replication of the WebText dataset from OpenAI.

  11. Dolly 15k Dutch

    • zenodo.org
    • huggingface.co
    • +1more
    bin
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy; Bram Vanroy (2023). Dolly 15k Dutch [Dataset]. http://doi.org/10.57967/hf/0785
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bram Vanroy; Bram Vanroy
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

    Data Instances

    {
     "id": 14963,
     "instruction": "Wat zijn de duurste steden ter wereld?",
     "context": "",
     "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.",
     "category": "brainstorming"
    }
    

    Data Fields

    • id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
    • instruction: the instruction (question)
    • context: additional context that the AI can use to answer the question
    • response: the AI's expected response
    • category: the category of this type of question (see Dolly for more info)

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
    
    Here are the requirements that you should adhere to:
    1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `;
    2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output;
    3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
    4. translate the instruction and context text using informal, but standard, language;
    5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
    6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
    7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
    8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
    
    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
    
    """
    

    The system message was:

    You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
    

    Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

    Initial Data Collection and Normalization

    Initial data collection by databricks. See their repository for more information about this dataset.

    Considerations for Using the Data

    Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

    Discussion of Biases

    As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

    Other Known Limitations

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub, its canonical repository.

  12. Number of companies using ChatGPT within their business 2023, by industry

    • statista.com
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Number of companies using ChatGPT within their business 2023, by industry [Dataset]. https://www.statista.com/statistics/1384323/industries-using-chatgpt-in-business/
    Explore at:
    Dataset updated
    Aug 12, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    ChatGPT, an artificial intelligence (AI) powered chatbot, is most used by companies in the technical and education industries, with over 200 companies using it in 2023. It is perhaps unsurprising that the technical field has embraced the use of ChatGPT, but it is interesting that so many educational institutes have begun to use it. While other industries do utilize the OpenAI-made chatbot, there are less than a 100 institutions and companies that use ChatGPT in other industries. This is especially true of agriculture, cultural, and legal industries, where only a single company is using ChatGPT in 2023.

  13. a

    Context Window Tokens by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Context Window Tokens by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Context Window: Tokens Limit; Higher is better by Model

  14. a

    Coding Index by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Coding Index by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode) by Model

  15. a

    Intelligence vs. Context Window by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Context Window by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index vs. Context Window (Tokens) by Model

  16. P

    MATH Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math
    Explore at:
    Dataset updated
    Jan 10, 2025
    Authors
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt
    Description

    MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

  17. a

    Latency vs. Output Speed by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Latency vs. Output Speed by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Latency (Time to First Token) vs. Output Speed (Output Tokens per Second) by Model

  18. a

    Pricing: Image Input Pricing by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Pricing: Image Input Pricing by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Image Input Price: USD per 1k images at 1MP (1024x1024) by Model

  19. a

    Intelligence vs. End-to-End Seconds to Output 100 Tokens by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. End-to-End Seconds to Output 100 Tokens by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index vs. End-to-End Seconds to Output 100 Tokens by Model

  20. a

    Math Index by Models Model

    • artificialanalysis.ai
    Updated Feb 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Math Index by Models Model [Dataset]. https://artificialanalysis.ai/models
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k

gsm8k

openai/gsm8k

Grade School Math 8K

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttp://openai.com/
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for GSM8K

  Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Search
Clear search
Close search
Google apps
Main menu