91 datasets found
  1. P

    Data from: OpenAI Gym Dataset

    • paperswithcode.com
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba (2021). OpenAI Gym Dataset [Dataset]. https://paperswithcode.com/dataset/openai-gym
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba
    Description

    OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.

  2. h

    allyarc_oai_format

    • huggingface.co
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllyArc (2024). allyarc_oai_format [Dataset]. https://huggingface.co/datasets/AllyArc/allyarc_oai_format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2024
    Dataset authored and provided by
    AllyArc
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for AllyArc/allyarc_oai_format

    This dataset card provides a structured overview of the AllyArc/allyarc_oai_format dataset, designed for training conversational AI models tailored for educational purposes, with a special focus on supporting students with diverse learning needs, including those in Special Educational Needs (SEN) education.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    The AllyArc/allyarc_oai_format dataset is comprised of conversational… See the full description on the dataset page: https://huggingface.co/datasets/AllyArc/allyarc_oai_format.

  3. AIMO-24: Model (openai-community/gpt2-large)

    • kaggle.com
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    Dinh Thoai Tran @ randrise.com
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    language: en

    license: mit

    GPT-2 Large

    Table of Contents

    Model Details

    Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

    How to Get Started with the Model

    Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='gpt2-large')
    >>> set_seed(42)
    >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    
    [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
     {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
     {'generated_text': "Hello, I'm a language model, why does this matter for you?
    
    When I hear new languages, I tend to start thinking in terms"},
     {'generated_text': "Hello, I'm a language model, a functional language...
    
    I don't need to know anything else. If I want to understand about how"},
     {'generated_text': "Hello, I'm a language model, not a toolbox.
    
    In a nutshell, a language model is a set of attributes that define how"}]
    

    Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import GPT2Tokenizer, GPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = GPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    and in TensorFlow:

    from transformers import GPT2Tokenizer, TFGPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = TFGPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

    Uses

    Direct Use

    In their model card about GPT-2, OpenAI wrote:

    The primary intended users of these models are AI researchers and practitioners.

    We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

    Downstream Use

    In their model card about GPT-2, OpenAI wrote:

    Here are some secondary use cases we believe are likely:

    • Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
    • Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
    • Entertainment: Creation of games, chat bots, and amusing generations.

    Misuse and Out-of-scope Use

    In their model card about GPT-2, OpenAI wrote:

    Because large-scale language models like GPT-2 ...

  4. openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  5. P

    Data from: WebText Dataset

    • paperswithcode.com
    Updated Feb 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2022). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
    Explore at:
    Dataset updated
    Feb 22, 2021
    Authors
    Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
    Description

    WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

    WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.

  6. f

    Implications for future LLM research.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Implications for future LLM research. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.

  7. o

    AI Summarisation Model Evaluation Dataset

    • opendatabay.com
    .undefined
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). AI Summarisation Model Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f95cfdab-cfe3-46a3-91a4-e5d8f15dcf15
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset provides a unique corpus for natural language processing tasks, specifically designed for text summarisation tools and for validating reward models from OpenAI. It includes text summaries sourced from the TL;DR, CNN, and Daily Mail datasets. The collection also contains essential supplementary information such as choices made by workers during the summarisation process, batch details to distinguish between different worker-generated summaries, and dataset attribute splits. This allows users to train state-of-the-art natural language processing systems with real-world data, facilitating the creation of reliable, concise summaries from longer texts. It enables developers to explore cutting-edge summarisation research whilst directly assessing against human-generated results.

    Columns

    • info: Provides contextual information about the original text to be summarised, including an ID, title, site, and the full article content.
    • summary: Contains the generated summaries of text from the source datasets.
    • worker: Denotes the specific worker who produced a given summary, useful for analysing worker-specific trends or biases.
    • batch: Indicates the batch identifier for summaries, helping to differentiate groups of summaries created by workers.
    • split: Specifies the dataset attribute split (e.g., training, validation) for machine learning tasks.

    Distribution

    The dataset is primarily available in CSV file format. It includes separate files for training, validation, and testing purposes, such as train.csv, validation.csv, and axis_test.csv. Specific numbers for the total rows or records across all files are not explicitly detailed in the provided information.

    Usage

    This dataset is ideal for: * Training natural language processing models to automatically generate text summaries. * Evaluating OpenAI's reward model for natural language processing, aiming to enhance its accuracy and performance. * Analysing worker and batch information to identify trends that might indicate bias or other issues impacting summarisation accuracy. * Developing machine learning models that understand and evaluate natural language processing.

    Coverage

    The dataset's content is derived from existing news and article sources like TL;DR, CNN, and Daily Mail, providing broad topical coverage. Its geographic scope is global. A specific time range for the original articles is not stated, but the dataset itself was listed on 11/06/2025. There are no explicit demographic notes on data availability.

    License

    CCO

    Who Can Use It

    • Data scientists and machine learning engineers developing and refining NLP models.
    • AI researchers focusing on text summarisation and generative AI.
    • Developers looking to integrate high-quality summarisation capabilities into their applications.
    • Academics and students studying natural language processing and model evaluation.

    Dataset Name Suggestions

    • OpenAI Text Summarisation Corpus
    • AI Summarisation Model Evaluation Dataset
    • NLP Human-Generated Summaries
    • Machine Learning Summarisation Benchmark
    • Text Summary Reward Model Data

    Attributes

    Original Data Source: OpenAI Summarization Corpus

  8. openai-humaneval

    • opendatalab.com
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zipline (2023). openai-humaneval [Dataset]. https://opendatalab.com/OpenDataLab/openai-humaneval
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    OpenAIhttps://openai.com/
    Anthropichttps://anthropic.com/
    Zipline
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

  9. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ āˆ’ Ć—Ć·) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  10. P

    Seaquest - OpenAI Gym Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc G. Bellemare; Yavar Naddaf; Joel Veness; Michael Bowling, Seaquest - OpenAI Gym Dataset [Dataset]. https://paperswithcode.com/dataset/seaquest-openai-gym
    Explore at:
    Authors
    Marc G. Bellemare; Yavar Naddaf; Joel Veness; Michael Bowling
    Description

    Dataset: The experiments are conducted using the Seaquest environment from the OpenAI Gym framework, which simulates the Atari 2600 game Seaquest. The dataset consists of RGB frames (210x160x3) generated dynamically during training. These frames are preprocessed by converting to grayscale, resizing to 84x84 pixels, and stacking four consecutive frames to form a 4x84x84 tensor, capturing temporal dynamics of the game state. No external or pre-collected dataset is used; the data is produced through real-time interaction with the Gym environment.

  11. S

    Synthetic Data Generation Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Synthetic Data Generation Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-generation-1124388
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The synthetic data generation market is experiencing explosive growth, driven by the increasing need for high-quality data in various applications, including AI/ML model training, data privacy compliance, and software testing. The market, currently estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the rising adoption of artificial intelligence and machine learning across industries demands large, high-quality datasets, often unavailable due to privacy concerns or data scarcity. Synthetic data provides a solution by generating realistic, privacy-preserving datasets that mirror real-world data without compromising sensitive information. Secondly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to explore alternative data solutions, making synthetic data a crucial tool for compliance. Finally, the advancements in generative AI models and algorithms are improving the quality and realism of synthetic data, expanding its applicability in various domains. Major players like Microsoft, Google, and AWS are actively investing in this space, driving further market expansion. The market segmentation reveals a diverse landscape with numerous specialized solutions. While large technology firms dominate the broader market, smaller, more agile companies are making significant inroads with specialized offerings focused on specific industry needs or data types. The geographical distribution is expected to be skewed towards North America and Europe initially, given the high concentration of technology companies and early adoption of advanced data technologies. However, growing awareness and increasing data needs in other regions are expected to drive substantial market growth in Asia-Pacific and other emerging markets in the coming years. The competitive landscape is characterized by a mix of established players and innovative startups, leading to continuous innovation and expansion of market applications. This dynamic environment indicates sustained growth in the foreseeable future, driven by an increasing recognition of synthetic data's potential to address critical data challenges across industries.

  12. T

    gpt3

    • tensorflow.org
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). gpt3 [Dataset]. https://www.tensorflow.org/datasets/catalog/gpt3
    Explore at:
    Dataset updated
    Dec 19, 2023
    Description

    Synthetic datasets for word scramble and arithmetic tasks described in the GPT3 paper.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('gpt3', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  13. t

    The Information’s AI Data Center Database

    • theinformation.com
    csv
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Information (2024). The Information’s AI Data Center Database [Dataset]. https://www.theinformation.com/projects/ai-data-center-database
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 3, 2024
    Dataset authored and provided by
    The Information
    Area covered
    Worldwide
    Dataset funded by
    The Information
    Description

    Top artificial intelligence firms are racing to build the biggest and most powerful Nvidia server chip clusters to win in AI. Below, we mapped the biggest completed and planned server clusters. Check back often, as we'll update the list when we confirm more data.

  14. h

    ScienticDatasetArxiv-openAI-FormatV4

    • huggingface.co
    Updated Aug 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edison Bejarano Sepulveda (2024). ScienticDatasetArxiv-openAI-FormatV4 [Dataset]. https://huggingface.co/datasets/ejbejaranos/ScienticDatasetArxiv-openAI-FormatV4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2024
    Authors
    Edison Bejarano Sepulveda
    Description

    šŸ“š Scientific Dataset Arxiv OpenAI Format Version 4

    This dataset contains scientific data transformed for use with OpenAI models. It includes detailed descriptions and structures designed for machine learning applications. The original data was taken from: from datasets import load_dataset dataset = load_dataset("taesiri/arxiv_qa")

      šŸ“‚ Dataset Structure
    

    The dataset is organized into a training split with comprehensive features tailored for scientific document… See the full description on the dataset page: https://huggingface.co/datasets/ejbejaranos/ScienticDatasetArxiv-openAI-FormatV4.

  15. LLM - Detect AI Datamix

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team šŸ” šŸ“ šŸ•µļøā€ā™‚ļø šŸ¤– during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  16. f

    Detailed breakdown of 19 questions, provided answers from LLM, three...

    • plos.figshare.com
    xlsx
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa SĆ”; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs (2024). Detailed breakdown of 19 questions, provided answers from LLM, three reviewer scores for accuracy, relevance, and readability per question, notes from reviewers (where relevant) explaining rationale for provided score, reviewer name, and annotation (1 = yes, 0 = no) for whether a hallucination was observed with an answer. [Dataset]. http://doi.org/10.1371/journal.pdig.0000568.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 21, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa SĆ”; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This set of information is provided for each of the for LLMs tested, one per worksheet. Also provided in a separate worksheet is the question grouping used to categorize questions in S1 Fig. The last worksheet contains details of answers provided by the RAG model when varying the number of answers (k) the model used to generate a final answer. Additional information on the papers used to generate answers and intermediate answers the model used to generate the final output are also given. (XLSX)

  17. f

    Number of questions scoring at least 2.5 or more per metric (Accuracy,...

    • plos.figshare.com
    xls
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa SĆ”; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs (2024). Number of questions scoring at least 2.5 or more per metric (Accuracy, Relevance, Readability). [Dataset]. http://doi.org/10.1371/journal.pdig.0000568.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 21, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa SĆ”; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of questions scoring at least 2.5 or more per metric (Accuracy, Relevance, Readability).

  18. IMDB 50K Movie Reviews (TEST your BERT)

    • kaggle.com
    Updated Dec 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atul Anand {Jha} (2019). IMDB 50K Movie Reviews (TEST your BERT) [Dataset]. https://www.kaggle.com/atulanandjha/imdb-50k-movie-reviews-test-your-bert/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atul Anand {Jha}
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Context

    Large Movie Review Dataset v1.0 . 😃

    https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252" alt="IMDB wall">

    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

    In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

    Reference: http://ai.stanford.edu/~amaas/data/sentiment/

    NOTE

    A starter kernel is here : https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel

    A kernel to expose Dataset collection :

    Content

    Now let’s understand the task in hand: given a movie review, predict whether it’s positive or negative.

    The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library.

    Each review is tagged pos or neg .

    There are 50% positive reviews and 50% negative reviews both in train and test sets.

    Columns:

    text : Reviews from people.

    Sentiment : Negative or Positive tag on the review/feedback (Boolean).

    Acknowledgements

    When using this Dataset Please Cite this ACL paper using :

    @InProceedings{

    maas-EtAl:2011:ACL-HLT2011,

    author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},

    title = {Learning Word Vectors for Sentiment Analysis},

    booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},

    month = {June},

    year = {2011},

    address = {Portland, Oregon, USA},

    publisher = {Association for Computational Linguistics},

    pages = {142--150},

    url = {http://www.aclweb.org/anthology/P11-1015}

    }

    Link to ref Dataset: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html

    https://www.samyzaf.com/ML/imdb/imdb.html

    Inspiration

    BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.

  19. A

    Artificial Intelligence Model Service Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Artificial Intelligence Model Service Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-model-service-1960466
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 12, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Artificial Intelligence (AI) Model Service market is experiencing explosive growth, driven by the increasing adoption of AI across various industries. While precise market size figures for 2025 are unavailable, considering the rapid advancements and investments in AI, a reasonable estimate places the market value at $50 billion. This substantial figure reflects the high demand for pre-trained and customizable AI models, eliminating the need for companies to develop these complex systems from scratch. Key drivers include the decreasing cost of cloud computing, the rising availability of large datasets for training, and the growing need for automation and improved efficiency across sectors like healthcare, finance, and manufacturing. The market is segmented by application (e.g., image recognition, natural language processing, predictive analytics) and model type (e.g., generative, discriminative). Leading players like OpenAI, Google, Amazon Web Services, and Microsoft are heavily investing in research and development, leading to continuous innovation and improvements in model accuracy and performance. Trends such as the increasing use of edge AI and the growing adoption of AI in small and medium-sized enterprises (SMEs) further contribute to the market's expansion. However, challenges remain, including concerns about data privacy, ethical implications of AI, and the need for skilled professionals to manage and deploy these sophisticated models effectively. Despite these restraints, the overall market outlook is overwhelmingly positive, with a projected Compound Annual Growth Rate (CAGR) suggesting a substantial increase in market value over the forecast period (2025-2033). The competitive landscape is dynamic, with established tech giants competing with innovative startups. The geographic distribution of the market shows strong growth in North America and Asia Pacific, driven by the presence of major technology hubs and early adoption of AI solutions. Europe and other regions are also experiencing significant growth, albeit at a potentially slightly slower pace. The forecast period (2025-2033) anticipates continued market expansion, fueled by technological breakthroughs, increased investment, and wider industry adoption. The market's evolution will be significantly shaped by ongoing research into explainable AI, improved model security, and the development of more efficient training techniques. Companies will likely focus on developing specialized AI models tailored to specific industry needs, offering customized solutions to further accelerate market growth. The ongoing development of more accessible and user-friendly AI tools is expected to widen the adoption across different segments, leading to continuous expansion throughout the forecast period.

  20. O

    Open Source Deep Learning Platform Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Open Source Deep Learning Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/open-source-deep-learning-platform-494147
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The open-source deep learning platform market is experiencing robust growth, projected to reach a substantial size driven by several key factors. The market's Compound Annual Growth Rate (CAGR) of 15.3% from 2019 to 2024 indicates a significant upward trajectory. This growth is fueled by the increasing adoption of deep learning across various sectors, including healthcare, finance, and autonomous vehicles. The accessibility and flexibility of open-source platforms, coupled with the vibrant community support and continuous innovation, are major contributors to this expansion. Leading technology companies like Google, Meta, Microsoft, NVIDIA, and OpenAI are actively involved in developing and supporting open-source deep learning frameworks, further boosting the market's momentum. The availability of pre-trained models and tools simplifies the development process, lowering the barrier to entry for both individuals and organizations. This democratization of AI development is accelerating the pace of innovation and driving wider adoption. Looking ahead to 2033, the market is expected to continue its impressive growth trajectory. The expanding data volume, the rising need for advanced analytics, and the increasing demand for customized AI solutions will be key drivers. The continued evolution of deep learning algorithms and hardware capabilities will further enhance the capabilities of open-source platforms. While potential restraints such as security concerns and the need for specialized expertise exist, the overall market outlook remains highly positive, promising substantial expansion and transformation across various industries. The $5887 million market size in 2025 provides a solid baseline for projecting future growth based on the 15.3% CAGR. This suggests a substantial market value within the forecast period (2025-2033).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba (2021). OpenAI Gym Dataset [Dataset]. https://paperswithcode.com/dataset/openai-gym

Data from: OpenAI Gym Dataset

Related Article
Explore at:
Dataset updated
Feb 2, 2021
Authors
Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba
Description

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.

Search
Clear search
Close search
Google apps
Main menu