100+ datasets found

AIMO-24: Model (openai-community/gpt2-large)
kaggle.com
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

How To Get Started With the Model

Uses

Risks, Limitations and Biases

Training

Evaluation

Environmental Impact

Technical Specifications

Citation Information

Model Card Authors

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.

Model Type: Transformer-based language model

Language(s): English

License: Modified MIT License

Related Models: GPT-2, GPT-Medium and GPT-XL

Resources for more information:

Research Paper

OpenAI Blog Post

GitHub Repo

OpenAI Model Card for GPT-2

Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='gpt2-large') >>> set_seed(42) >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5) [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"}, {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"}, {'generated_text': "Hello, I'm a language model, why does this matter for you? When I hear new languages, I tend to start thinking in terms"}, {'generated_text': "Hello, I'm a language model, a functional language... I don't need to know anything else. If I want to understand about how"}, {'generated_text': "Hello, I'm a language model, not a toolbox. In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = TFGPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...
h
getting-started-labeled-photos
huggingface.co
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Voxel51 (2025). getting-started-labeled-photos [Dataset]. https://huggingface.co/datasets/Voxel51/getting-started-labeled-photos
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset authored and provided by
Voxel51
Description
Dataset Card for predicted_labels

These photos are used in the FiftyOne getting started webinar. The images have a prediction label where were generated by self-supervised classification through a OpenClip Model. https://github.com/thesteve0/fiftyone-getting-started/blob/main/5_generating_labels.py They were then manually cleaned to produce the ground truth label. https://github.com/thesteve0/fiftyone-getting-started/blob/main/6_clean_labels.md They are 300 public domain photos… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/getting-started-labeled-photos.
R
Huggingface 42 Lerobot Dataset
universe.roboflow.com
zip
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
huggingface42lerobot (2025). Huggingface 42 Lerobot Dataset [Dataset]. https://universe.roboflow.com/huggingface42lerobot/huggingface-42-lerobot/dataset/9
Explore at:
zipAvailable download formats
Dataset updated
Jun 19, 2025
Dataset authored and provided by
huggingface42lerobot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Tokens Bounding Boxes
Description
Huggingface 42 Lerobot

## Overview Huggingface 42 Lerobot is a dataset for object detection tasks - it contains Tokens annotations for 1,411 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
h
ImageCoDe
huggingface.co
Updated Aug 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benno Krojer (2022). ImageCoDe [Dataset]. https://huggingface.co/datasets/BennoKrojer/ImageCoDe
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2022
Authors
Benno Krojer
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Dataset Card for ImageCoDe

To get started quickly, load descriptions via: from datasets import load_dataset examples = load_dataset('BennoKrojer/ImageCoDe')

And download image_sets.zip for all images sets (each directory consisting of 10 images).

Dataset Summary

We introduce ImageCoDe, a vision-and-language benchmark that requires contextual language understanding in the form of pragmatics, temporality, long descriptions and visual nuances. The task: Given a detailed… See the full description on the dataset page: https://huggingface.co/datasets/BennoKrojer/ImageCoDe.
h
BigOBench
huggingface.co
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2025). BigOBench [Dataset]. https://huggingface.co/datasets/facebook/BigOBench
Explore at:
Dataset updated
Mar 20, 2025
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
👋 Overview

🚀 Introduction

📋 Getting Started with the data

🔥 problem_and_human_solutions_list.jsonl

🔥 complexity_labels_light.jsonl

🔥 complexity_labels_full.jsonl

🔥 time_complexity_test_set.jsonl

🔥 space_complexity_test_set.jsonl

License

📝 Citation

🚀 Introduction

BigO(Bench) is a benchmark of ~300 code problems to be solved in Python, along with 3,105 coding problems… See the full description on the dataset page: https://huggingface.co/datasets/facebook/BigOBench.
stack-exchange-preferences
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for H4 Stack Exchange Preferences Dataset

Dataset Summary

This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
h
huggingface-models-raw
huggingface.co
Updated Mar 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fazil (2022). huggingface-models-raw [Dataset]. https://huggingface.co/datasets/ftopal/huggingface-models-raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2022
Authors
Fazil
Description
ftopal/huggingface-models-raw dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-instruction-tuning-dataset
huggingface.co
Updated Apr 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hu (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/Neooooo/huggingface-smol-course-instruction-tuning-dataset
Explore at:
Dataset updated
Apr 25, 2025
Authors
Hu
Description
Dataset Card for huggingface-smol-course-instruction-tuning-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Neooooo/huggingface-smol-course-instruction-tuning-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/Neooooo/huggingface-smol-course-instruction-tuning-dataset.
h
fineweb-edu
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2497
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
h
common_corpus
huggingface.co
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
Explore at:
Dataset updated
Nov 13, 2024
Dataset authored and provided by
PleIAs
Description
Common Corpus

Full data paper

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
h
starcoderdata
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode, starcoderdata [Dataset]. https://huggingface.co/datasets/bigcode/starcoderdata
Explore at:
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
StarCoder Training Dataset

Dataset description

This is the dataset used for training StarCoder and StarCoderBase. It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens.

Dataset creation

The creation and filtering of The Stack is explained in the original dataset, we additionally decontaminate and… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/starcoderdata.
databricks-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
h
gretel-safety-alignment-en-v1
huggingface.co
Updated Dec 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gretel-safety-alignment-en-v1 [Dataset]. https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2024
Dataset provided by
Gretel.ai
Description
Gretel Synthetic Safety Alignment Dataset

This dataset is a synthetically generated collection of prompt-response-safe_response triplets that can be used for aligning language models. Created using Gretel Navigator's AI Data Designer using small language models like ibm-granite/granite-3.0-8b, ibm-granite/granite-3.0-8b-instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-instruct and mistralai/Mistral-Nemo-Instruct-2407.

Dataset Statistics

Total Records: 8,361 Total… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1.
boolq
huggingface.co
Updated Dec 15, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2014). boolq [Dataset]. https://huggingface.co/datasets/google/boolq
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2014
Dataset authored and provided by
Googlehttp://google.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for Boolq

Dataset Summary

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.

Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/boolq.
h
MELD-ST
huggingface.co
Updated May 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Media Processing Lab at Kyoto University (2024). MELD-ST [Dataset]. https://huggingface.co/datasets/ku-nlp/MELD-ST
Explore at:
Dataset updated
May 26, 2024
Dataset authored and provided by
Language Media Processing Lab at Kyoto University
Description
MELD-ST: An Emotion-aware Speech Translation Dataset

Paper: https://arxiv.org/abs/2405.13233

Overview

This emotion-aware speech translation dataset is a multi-language dataset extracted from the TV show "Friends." It includes English, Japanese, and German subtitles along with corresponding timestamps. This dataset is designed for natural language processing tasks.

Contents

The dataset is partitioned into train, test, and development subsets to streamline… See the full description on the dataset page: https://huggingface.co/datasets/ku-nlp/MELD-ST.
h
medmcqa
huggingface.co
Updated May 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Life Science AI (2022). medmcqa [Dataset]. https://huggingface.co/datasets/openlifescienceai/medmcqa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2022
Dataset authored and provided by
Open Life Science AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for MedMCQA

Dataset Summary

MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which require… See the full description on the dataset page: https://huggingface.co/datasets/openlifescienceai/medmcqa.
h
Plot2Code
huggingface.co
Updated May 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ARC Lab, Tencent PCG (2024). Plot2Code [Dataset]. https://huggingface.co/datasets/TencentARC/Plot2Code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2024
Dataset authored and provided by
ARC Lab, Tencent PCG
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Plot2Code Benchmark

Plot2Code benchmark is now open-sourced at huggingface (ARC Lab) and GitHub. More information can be found in our paper.

Why we need Plot2Code?

🧐 While MLLMs have demonstrated potential in visual contexts, their capabilities in visual coding tasks have not been thoroughly evaluated. Plot2Code offers a platform for comprehensive assessment of these models.

🤗 To enable individuals to ascertain the proficiency of AI assistants in generating code that… See the full description on the dataset page: https://huggingface.co/datasets/TencentARC/Plot2Code.
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large

AIMO-24: Model (openai-community/gpt2-large)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Apr 7, 2024

Authors

Dinh Thoai Tran @ randrise.com

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

language: en

license: mit

GPT-2 Large

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.
Model Type: Transformer-based language model
Language(s): English
License: Modified MIT License
Related Models: GPT-2, GPT-Medium and GPT-XL
Resources for more information:
- Research Paper
- OpenAI Blog Post
- GitHub Repo
- OpenAI Model Card for GPT-2
- Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2-large')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

[{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
 {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
 {'generated_text': "Hello, I'm a language model, why does this matter for you?

When I hear new languages, I tend to start thinking in terms"},
 {'generated_text': "Hello, I'm a language model, a functional language...

I don't need to know anything else. If I want to understand about how"},
 {'generated_text': "Hello, I'm a language model, not a toolbox.

In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = TFGPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...

Clear search

Close search

Google apps

Main menu

AIMO-24: Model (openai-community/gpt2-large)

license: mit

GPT-2 Large

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Misuse and Out-of-scope Use

getting-started-labeled-photos

Huggingface 42 Lerobot Dataset

Huggingface 42 Lerobot

SlimPajama-627B

ImageCoDe

BigOBench

stack-exchange-preferences

instruction-dataset

huggingface-models-raw

huggingface-smol-course-instruction-tuning-dataset

fineweb-edu

common_corpus

starcoderdata

databricks-dolly-15k

gretel-safety-alignment-en-v1

boolq

MELD-ST

medmcqa

Plot2Code

Data from: imdb

AIMO-24: Model (openai-community/gpt2-large)

license: mit

GPT-2 Large

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Misuse and Out-of-scope Use