91 datasets found

P
Data from: OpenAI Gym Dataset
paperswithcode.com
Updated Feb 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba (2021). OpenAI Gym Dataset [Dataset]. https://paperswithcode.com/dataset/openai-gym
Explore at:
Dataset updated
Feb 2, 2021
Authors
Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba
Description
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.
h
allyarc_oai_format
huggingface.co
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllyArc (2024). allyarc_oai_format [Dataset]. https://huggingface.co/datasets/AllyArc/allyarc_oai_format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2024
Dataset authored and provided by
AllyArc
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for AllyArc/allyarc_oai_format

This dataset card provides a structured overview of the AllyArc/allyarc_oai_format dataset, designed for training conversational AI models tailored for educational purposes, with a special focus on supporting students with diverse learning needs, including those in Special Educational Needs (SEN) education.

Dataset Details Dataset Description

The AllyArc/allyarc_oai_format dataset is comprised of conversational… See the full description on the dataset page: https://huggingface.co/datasets/AllyArc/allyarc_oai_format.
AIMO-24: Model (openai-community/gpt2-large)
kaggle.com
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

How To Get Started With the Model

Uses

Risks, Limitations and Biases

Training

Evaluation

Environmental Impact

Technical Specifications

Citation Information

Model Card Authors

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.

Model Type: Transformer-based language model

Language(s): English

License: Modified MIT License

Related Models: GPT-2, GPT-Medium and GPT-XL

Resources for more information:

Research Paper

OpenAI Blog Post

GitHub Repo

OpenAI Model Card for GPT-2

Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='gpt2-large') >>> set_seed(42) >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5) [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"}, {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"}, {'generated_text': "Hello, I'm a language model, why does this matter for you? When I hear new languages, I tend to start thinking in terms"}, {'generated_text': "Hello, I'm a language model, a functional language... I don't need to know anything else. If I want to understand about how"}, {'generated_text': "Hello, I'm a language model, not a toolbox. In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = TFGPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
P
Data from: WebText Dataset
paperswithcode.com
Updated Feb 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2022). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
Explore at:
Dataset updated
Feb 22, 2021
Authors
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
Description
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
f
Implications for future LLM research.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Implications for future LLM research. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000417.t002
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS Digital Health
Authors
Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.
o
AI Summarisation Model Evaluation Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). AI Summarisation Model Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f95cfdab-cfe3-46a3-91a4-e5d8f15dcf15
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides a unique corpus for natural language processing tasks, specifically designed for text summarisation tools and for validating reward models from OpenAI. It includes text summaries sourced from the TL;DR, CNN, and Daily Mail datasets. The collection also contains essential supplementary information such as choices made by workers during the summarisation process, batch details to distinguish between different worker-generated summaries, and dataset attribute splits. This allows users to train state-of-the-art natural language processing systems with real-world data, facilitating the creation of reliable, concise summaries from longer texts. It enables developers to explore cutting-edge summarisation research whilst directly assessing against human-generated results.

Columns

info: Provides contextual information about the original text to be summarised, including an ID, title, site, and the full article content.

summary: Contains the generated summaries of text from the source datasets.

worker: Denotes the specific worker who produced a given summary, useful for analysing worker-specific trends or biases.

batch: Indicates the batch identifier for summaries, helping to differentiate groups of summaries created by workers.

split: Specifies the dataset attribute split (e.g., training, validation) for machine learning tasks.

Distribution

The dataset is primarily available in CSV file format. It includes separate files for training, validation, and testing purposes, such as train.csv, validation.csv, and axis_test.csv. Specific numbers for the total rows or records across all files are not explicitly detailed in the provided information.

Usage

This dataset is ideal for: * Training natural language processing models to automatically generate text summaries. * Evaluating OpenAI's reward model for natural language processing, aiming to enhance its accuracy and performance. * Analysing worker and batch information to identify trends that might indicate bias or other issues impacting summarisation accuracy. * Developing machine learning models that understand and evaluate natural language processing.

Coverage

The dataset's content is derived from existing news and article sources like TL;DR, CNN, and Daily Mail, providing broad topical coverage. Its geographic scope is global. A specific time range for the original articles is not stated, but the dataset itself was listed on 11/06/2025. There are no explicit demographic notes on data availability.

License

CCO

Who Can Use It

Data scientists and machine learning engineers developing and refining NLP models.

AI researchers focusing on text summarisation and generative AI.

Developers looking to integrate high-quality summarisation capabilities into their applications.

Academics and students studying natural language processing and model evaluation.

Dataset Name Suggestions

OpenAI Text Summarisation Corpus

AI Summarisation Model Evaluation Dataset

NLP Human-Generated Summaries

Machine Learning Summarisation Benchmark

Text Summary Reward Model Data

Attributes

Original Data Source: OpenAI Summarization Corpus
openai-humaneval
opendatalab.com
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zipline (2023). openai-humaneval [Dataset]. https://opendatalab.com/OpenDataLab/openai-humaneval
Explore at:
zipAvailable download formats
Dataset updated
Dec 16, 2023
Dataset provided by
OpenAIhttps://openai.com/
Anthropichttps://anthropic.com/
Zipline
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
gsm8k
huggingface.co
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for GSM8K

Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
P
Seaquest - OpenAI Gym Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc G. Bellemare; Yavar Naddaf; Joel Veness; Michael Bowling, Seaquest - OpenAI Gym Dataset [Dataset]. https://paperswithcode.com/dataset/seaquest-openai-gym
Explore at:
Authors
Marc G. Bellemare; Yavar Naddaf; Joel Veness; Michael Bowling
Description
Dataset: The experiments are conducted using the Seaquest environment from the OpenAI Gym framework, which simulates the Atari 2600 game Seaquest. The dataset consists of RGB frames (210x160x3) generated dynamically during training. These frames are preprocessed by converting to grayscale, resizing to 84x84 pixels, and stacking four consecutive frames to form a 4x84x84 tensor, capturing temporal dynamics of the game state. No external or pre-collected dataset is used; the data is produced through real-time interaction with the Gym environment.
S
Synthetic Data Generation Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Synthetic Data Generation Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-generation-1124388
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 16, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The synthetic data generation market is experiencing explosive growth, driven by the increasing need for high-quality data in various applications, including AI/ML model training, data privacy compliance, and software testing. The market, currently estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the rising adoption of artificial intelligence and machine learning across industries demands large, high-quality datasets, often unavailable due to privacy concerns or data scarcity. Synthetic data provides a solution by generating realistic, privacy-preserving datasets that mirror real-world data without compromising sensitive information. Secondly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to explore alternative data solutions, making synthetic data a crucial tool for compliance. Finally, the advancements in generative AI models and algorithms are improving the quality and realism of synthetic data, expanding its applicability in various domains. Major players like Microsoft, Google, and AWS are actively investing in this space, driving further market expansion. The market segmentation reveals a diverse landscape with numerous specialized solutions. While large technology firms dominate the broader market, smaller, more agile companies are making significant inroads with specialized offerings focused on specific industry needs or data types. The geographical distribution is expected to be skewed towards North America and Europe initially, given the high concentration of technology companies and early adoption of advanced data technologies. However, growing awareness and increasing data needs in other regions are expected to drive substantial market growth in Asia-Pacific and other emerging markets in the coming years. The competitive landscape is characterized by a mix of established players and innovative startups, leading to continuous innovation and expansion of market applications. This dynamic environment indicates sustained growth in the foreseeable future, driven by an increasing recognition of synthetic data's potential to address critical data challenges across industries.
T
gpt3
tensorflow.org
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). gpt3 [Dataset]. https://www.tensorflow.org/datasets/catalog/gpt3
Explore at:
Dataset updated
Dec 19, 2023
Description
Synthetic datasets for word scramble and arithmetic tasks described in the GPT3 paper.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('gpt3', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
t
The Information’s AI Data Center Database
theinformation.com
csv
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Information (2024). The Information’s AI Data Center Database [Dataset]. https://www.theinformation.com/projects/ai-data-center-database
Explore at:
csvAvailable download formats
Dataset updated
Sep 3, 2024
Dataset authored and provided by
The Information
Area covered
Worldwide
Dataset funded by
The Information
Description
Top artificial intelligence firms are racing to build the biggest and most powerful Nvidia server chip clusters to win in AI. Below, we mapped the biggest completed and planned server clusters. Check back often, as we'll update the list when we confirm more data.
h
ScienticDatasetArxiv-openAI-FormatV4
huggingface.co
Updated Aug 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edison Bejarano Sepulveda (2024). ScienticDatasetArxiv-openAI-FormatV4 [Dataset]. https://huggingface.co/datasets/ejbejaranos/ScienticDatasetArxiv-openAI-FormatV4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2024
Authors
Edison Bejarano Sepulveda
Description
📚 Scientific Dataset Arxiv OpenAI Format Version 4

This dataset contains scientific data transformed for use with OpenAI models. It includes detailed descriptions and structures designed for machine learning applications. The original data was taken from: from datasets import load_dataset dataset = load_dataset("taesiri/arxiv_qa")

📂 Dataset Structure

The dataset is organized into a training split with comprehensive features tailored for scientific document… See the full description on the dataset page: https://huggingface.co/datasets/ejbejaranos/ScienticDatasetArxiv-openAI-FormatV4.
LLM - Detect AI Datamix
kaggle.com
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
f
Detailed breakdown of 19 questions, provided answers from LLM, three...
plos.figshare.com
xlsx
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa Sá; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs (2024). Detailed breakdown of 19 questions, provided answers from LLM, three reviewer scores for accuracy, relevance, and readability per question, notes from reviewers (where relevant) explaining rationale for provided score, reviewer name, and annotation (1 = yes, 0 = no) for whether a hallucination was observed with an answer. [Dataset]. http://doi.org/10.1371/journal.pdig.0000568.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000568.s002
Dataset updated
Aug 21, 2024
Dataset provided by
PLOS Digital Health
Authors
David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa Sá; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This set of information is provided for each of the for LLMs tested, one per worksheet. Also provided in a separate worksheet is the question grouping used to categorize questions in S1 Fig. The last worksheet contains details of answers provided by the RAG model when varying the number of answers (k) the model used to generate a final answer. Additional information on the papers used to generate answers and intermediate answers the model used to generate the final output are also given. (XLSX)
f
Number of questions scoring at least 2.5 or more per metric (Accuracy,...
plos.figshare.com
xls
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa Sá; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs (2024). Number of questions scoring at least 2.5 or more per metric (Accuracy, Relevance, Readability). [Dataset]. http://doi.org/10.1371/journal.pdig.0000568.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000568.t001
Dataset updated
Aug 21, 2024
Dataset provided by
PLOS Digital Health
Authors
David Soong; Sriram Sridhar; Han Si; Jan-Samuel Wagner; Ana Caroline Costa Sá; Christina Y. Yu; Kubra Karagoz; Meijian Guan; Sanyam Kumar; Hisham Hamadeh; Brandon W. Higgs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of questions scoring at least 2.5 or more per metric (Accuracy, Relevance, Readability).
IMDB 50K Movie Reviews (TEST your BERT)
kaggle.com
Updated Dec 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atul Anand {Jha} (2019). IMDB 50K Movie Reviews (TEST your BERT) [Dataset]. https://www.kaggle.com/atulanandjha/imdb-50k-movie-reviews-test-your-bert/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atul Anand {Jha}
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Context

Large Movie Review Dataset v1.0 . 😃

https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252" alt="IMDB wall">

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Reference: http://ai.stanford.edu/~amaas/data/sentiment/

NOTE

A starter kernel is here : https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel

A kernel to expose Dataset collection :

Content

Now let’s understand the task in hand: given a movie review, predict whether it’s positive or negative.

The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library.

Each review is tagged pos or neg .

There are 50% positive reviews and 50% negative reviews both in train and test sets.

Columns:

text : Reviews from people.

Sentiment : Negative or Positive tag on the review/feedback (Boolean).

Acknowledgements

When using this Dataset Please Cite this ACL paper using :

@InProceedings{

maas-EtAl:2011:ACL-HLT2011,

author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},

title = {Learning Word Vectors for Sentiment Analysis},

booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},

month = {June},

year = {2011},

address = {Portland, Oregon, USA},

publisher = {Association for Computational Linguistics},

pages = {142--150},

url = {http://www.aclweb.org/anthology/P11-1015}

}

Link to ref Dataset: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html

https://www.samyzaf.com/ML/imdb/imdb.html

Inspiration

BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.
A
Artificial Intelligence Model Service Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Artificial Intelligence Model Service Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-model-service-1960466
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 12, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Artificial Intelligence (AI) Model Service market is experiencing explosive growth, driven by the increasing adoption of AI across various industries. While precise market size figures for 2025 are unavailable, considering the rapid advancements and investments in AI, a reasonable estimate places the market value at $50 billion. This substantial figure reflects the high demand for pre-trained and customizable AI models, eliminating the need for companies to develop these complex systems from scratch. Key drivers include the decreasing cost of cloud computing, the rising availability of large datasets for training, and the growing need for automation and improved efficiency across sectors like healthcare, finance, and manufacturing. The market is segmented by application (e.g., image recognition, natural language processing, predictive analytics) and model type (e.g., generative, discriminative). Leading players like OpenAI, Google, Amazon Web Services, and Microsoft are heavily investing in research and development, leading to continuous innovation and improvements in model accuracy and performance. Trends such as the increasing use of edge AI and the growing adoption of AI in small and medium-sized enterprises (SMEs) further contribute to the market's expansion. However, challenges remain, including concerns about data privacy, ethical implications of AI, and the need for skilled professionals to manage and deploy these sophisticated models effectively. Despite these restraints, the overall market outlook is overwhelmingly positive, with a projected Compound Annual Growth Rate (CAGR) suggesting a substantial increase in market value over the forecast period (2025-2033). The competitive landscape is dynamic, with established tech giants competing with innovative startups. The geographic distribution of the market shows strong growth in North America and Asia Pacific, driven by the presence of major technology hubs and early adoption of AI solutions. Europe and other regions are also experiencing significant growth, albeit at a potentially slightly slower pace. The forecast period (2025-2033) anticipates continued market expansion, fueled by technological breakthroughs, increased investment, and wider industry adoption. The market's evolution will be significantly shaped by ongoing research into explainable AI, improved model security, and the development of more efficient training techniques. Companies will likely focus on developing specialized AI models tailored to specific industry needs, offering customized solutions to further accelerate market growth. The ongoing development of more accessible and user-friendly AI tools is expected to widen the adoption across different segments, leading to continuous expansion throughout the forecast period.
O
Open Source Deep Learning Platform Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Open Source Deep Learning Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/open-source-deep-learning-platform-494147
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Jun 1, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The open-source deep learning platform market is experiencing robust growth, projected to reach a substantial size driven by several key factors. The market's Compound Annual Growth Rate (CAGR) of 15.3% from 2019 to 2024 indicates a significant upward trajectory. This growth is fueled by the increasing adoption of deep learning across various sectors, including healthcare, finance, and autonomous vehicles. The accessibility and flexibility of open-source platforms, coupled with the vibrant community support and continuous innovation, are major contributors to this expansion. Leading technology companies like Google, Meta, Microsoft, NVIDIA, and OpenAI are actively involved in developing and supporting open-source deep learning frameworks, further boosting the market's momentum. The availability of pre-trained models and tools simplifies the development process, lowering the barrier to entry for both individuals and organizations. This democratization of AI development is accelerating the pace of innovation and driving wider adoption. Looking ahead to 2033, the market is expected to continue its impressive growth trajectory. The expanding data volume, the rising need for advanced analytics, and the increasing demand for customized AI solutions will be key drivers. The continued evolution of deep learning algorithms and hardware capabilities will further enhance the capabilities of open-source platforms. While potential restraints such as security concerns and the need for specialized expertise exist, the overall market outlook remains highly positive, promising substantial expansion and transformation across various industries. The $5887 million market size in 2025 provides a solid baseline for projecting future growth based on the 15.3% CAGR. This suggests a substantial market value within the forecast period (2025-2033).

Facebook

Twitter

Click to copy link

Link copied

Cite

Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba (2021). OpenAI Gym Dataset [Dataset]. https://paperswithcode.com/dataset/openai-gym

Data from: OpenAI Gym Dataset

Explore at:

Dataset updated

Feb 2, 2021

Authors

Greg Brockman; Vicki Cheung; Ludwig Pettersson; Jonas Schneider; John Schulman; Jie Tang; Wojciech Zaremba

Description

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.

Clear search

Close search

Google apps

Main menu

Data from: OpenAI Gym Dataset

allyarc_oai_format

AIMO-24: Model (openai-community/gpt2-large)

license: mit

GPT-2 Large

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Misuse and Out-of-scope Use

openai_humaneval

Data from: WebText Dataset

Implications for future LLM research.

AI Summarisation Model Evaluation Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

openai-humaneval

gsm8k

Seaquest - OpenAI Gym Dataset

Synthetic Data Generation Report

gpt3

The Information’s AI Data Center Database

ScienticDatasetArxiv-openAI-FormatV4

LLM - Detect AI Datamix

Detailed breakdown of 19 questions, provided answers from LLM, three...

Number of questions scoring at least 2.5 or more per metric (Accuracy,...

IMDB 50K Movie Reviews (TEST your BERT)

Context

Content

Columns:

Acknowledgements

Inspiration

Artificial Intelligence Model Service Report

Open Source Deep Learning Platform Report

Data from: OpenAI Gym Dataset