100+ datasets found
  1. AIMO-24: Model (openai-community/gpt2-large)

    • kaggle.com
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    Dinh Thoai Tran @ randrise.com
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    language: en

    license: mit

    GPT-2 Large

    Table of Contents

    Model Details

    Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

    How to Get Started with the Model

    Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='gpt2-large')
    >>> set_seed(42)
    >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    
    [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
     {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
     {'generated_text': "Hello, I'm a language model, why does this matter for you?
    
    When I hear new languages, I tend to start thinking in terms"},
     {'generated_text': "Hello, I'm a language model, a functional language...
    
    I don't need to know anything else. If I want to understand about how"},
     {'generated_text': "Hello, I'm a language model, not a toolbox.
    
    In a nutshell, a language model is a set of attributes that define how"}]
    

    Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import GPT2Tokenizer, GPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = GPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    and in TensorFlow:

    from transformers import GPT2Tokenizer, TFGPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = TFGPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

    Uses

    Direct Use

    In their model card about GPT-2, OpenAI wrote:

    The primary intended users of these models are AI researchers and practitioners.

    We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

    Downstream Use

    In their model card about GPT-2, OpenAI wrote:

    Here are some secondary use cases we believe are likely:

    • Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
    • Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
    • Entertainment: Creation of games, chat bots, and amusing generations.

    Misuse and Out-of-scope Use

    In their model card about GPT-2, OpenAI wrote:

    Because large-scale language models like GPT-2 ...

  2. h

    getting-started-labeled-photos

    • huggingface.co
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Voxel51 (2025). getting-started-labeled-photos [Dataset]. https://huggingface.co/datasets/Voxel51/getting-started-labeled-photos
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    Voxel51
    Description

    Dataset Card for predicted_labels

    These photos are used in the FiftyOne getting started webinar. The images have a prediction label where were generated by self-supervised classification through a OpenClip Model. https://github.com/thesteve0/fiftyone-getting-started/blob/main/5_generating_labels.py They were then manually cleaned to produce the ground truth label. https://github.com/thesteve0/fiftyone-getting-started/blob/main/6_clean_labels.md They are 300 public domain photos… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/getting-started-labeled-photos.

  3. R

    Huggingface 42 Lerobot Dataset

    • universe.roboflow.com
    zip
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    huggingface42lerobot (2025). Huggingface 42 Lerobot Dataset [Dataset]. https://universe.roboflow.com/huggingface42lerobot/huggingface-42-lerobot/dataset/9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2025
    Dataset authored and provided by
    huggingface42lerobot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Tokens Bounding Boxes
    Description

    Huggingface 42 Lerobot

    ## Overview
    
    Huggingface 42 Lerobot is a dataset for object detection tasks - it contains Tokens annotations for 1,411 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. h

    SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebras
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  5. h

    ImageCoDe

    • huggingface.co
    Updated Aug 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Krojer (2022). ImageCoDe [Dataset]. https://huggingface.co/datasets/BennoKrojer/ImageCoDe
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2022
    Authors
    Benno Krojer
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    Dataset Card for ImageCoDe

    To get started quickly, load descriptions via: from datasets import load_dataset examples = load_dataset('BennoKrojer/ImageCoDe')

    And download image_sets.zip for all images sets (each directory consisting of 10 images).

      Dataset Summary
    

    We introduce ImageCoDe, a vision-and-language benchmark that requires contextual language understanding in the form of pragmatics, temporality, long descriptions and visual nuances. The task: Given a detailed… See the full description on the dataset page: https://huggingface.co/datasets/BennoKrojer/ImageCoDe.

  6. h

    BigOBench

    • huggingface.co
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2025). BigOBench [Dataset]. https://huggingface.co/datasets/facebook/BigOBench
    Explore at:
    Dataset updated
    Mar 20, 2025
    Dataset authored and provided by
    AI at Meta
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    👋 Overview

    🚀 Introduction

    📋 Getting Started with the data

    🔥 problem_and_human_solutions_list.jsonl

    🔥 complexity_labels_light.jsonl

    🔥 complexity_labels_full.jsonl

    🔥 time_complexity_test_set.jsonl

    🔥 space_complexity_test_set.jsonl

    License

    📝 Citation

      🚀 Introduction
    

    BigO(Bench) is a benchmark of ~300 code problems to be solved in Python, along with 3,105 coding problems… See the full description on the dataset page: https://huggingface.co/datasets/facebook/BigOBench.

  7. stack-exchange-preferences

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for H4 Stack Exchange Preferences Dataset

      Dataset Summary
    

    This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  8. instruction-dataset

    • huggingface.co
    • opendatalab.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.

  9. h

    huggingface-models-raw

    • huggingface.co
    Updated Mar 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fazil (2022). huggingface-models-raw [Dataset]. https://huggingface.co/datasets/ftopal/huggingface-models-raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2022
    Authors
    Fazil
    Description

    ftopal/huggingface-models-raw dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    huggingface-smol-course-instruction-tuning-dataset

    • huggingface.co
    Updated Apr 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hu (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/Neooooo/huggingface-smol-course-instruction-tuning-dataset
    Explore at:
    Dataset updated
    Apr 25, 2025
    Authors
    Hu
    Description

    Dataset Card for huggingface-smol-course-instruction-tuning-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Neooooo/huggingface-smol-course-instruction-tuning-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/Neooooo/huggingface-smol-course-instruction-tuning-dataset.

  11. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  12. h

    common_corpus

    • huggingface.co
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    PleIAs
    Description

    Common Corpus

    Full data paper

    Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

  13. h

    starcoderdata

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode, starcoderdata [Dataset]. https://huggingface.co/datasets/bigcode/starcoderdata
    Explore at:
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    StarCoder Training Dataset

      Dataset description
    

    This is the dataset used for training StarCoder and StarCoderBase. It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens.

      Dataset creation
    

    The creation and filtering of The Stack is explained in the original dataset, we additionally decontaminate and… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/starcoderdata.

  14. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  15. h

    gretel-safety-alignment-en-v1

    • huggingface.co
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gretel-safety-alignment-en-v1 [Dataset]. https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Dataset provided by
    Gretel.ai
    Description

    Gretel Synthetic Safety Alignment Dataset

    This dataset is a synthetically generated collection of prompt-response-safe_response triplets that can be used for aligning language models. Created using Gretel Navigator's AI Data Designer using small language models like ibm-granite/granite-3.0-8b, ibm-granite/granite-3.0-8b-instruct, Qwen/Qwen2.5-7B, Qwen/Qwen2.5-7B-instruct and mistralai/Mistral-Nemo-Instruct-2407.

      Dataset Statistics
    

    Total Records: 8,361 Total… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1.

  16. boolq

    • huggingface.co
    Updated Dec 15, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2014). boolq [Dataset]. https://huggingface.co/datasets/google/boolq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2014
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for Boolq

      Dataset Summary
    

    BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.

      Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/boolq.
    
  17. h

    MELD-ST

    • huggingface.co
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Media Processing Lab at Kyoto University (2024). MELD-ST [Dataset]. https://huggingface.co/datasets/ku-nlp/MELD-ST
    Explore at:
    Dataset updated
    May 26, 2024
    Dataset authored and provided by
    Language Media Processing Lab at Kyoto University
    Description

    MELD-ST: An Emotion-aware Speech Translation Dataset

    Paper: https://arxiv.org/abs/2405.13233

      Overview
    

    This emotion-aware speech translation dataset is a multi-language dataset extracted from the TV show "Friends." It includes English, Japanese, and German subtitles along with corresponding timestamps. This dataset is designed for natural language processing tasks.

      Contents
    

    The dataset is partitioned into train, test, and development subsets to streamline… See the full description on the dataset page: https://huggingface.co/datasets/ku-nlp/MELD-ST.

  18. h

    medmcqa

    • huggingface.co
    Updated May 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Life Science AI (2022). medmcqa [Dataset]. https://huggingface.co/datasets/openlifescienceai/medmcqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2022
    Dataset authored and provided by
    Open Life Science AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for MedMCQA

      Dataset Summary
    

    MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which require… See the full description on the dataset page: https://huggingface.co/datasets/openlifescienceai/medmcqa.

  19. h

    Plot2Code

    • huggingface.co
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ARC Lab, Tencent PCG (2024). Plot2Code [Dataset]. https://huggingface.co/datasets/TencentARC/Plot2Code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2024
    Dataset authored and provided by
    ARC Lab, Tencent PCG
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Plot2Code Benchmark

    Plot2Code benchmark is now open-sourced at huggingface (ARC Lab) and GitHub. More information can be found in our paper.

      Why we need Plot2Code?
    

    🧐 While MLLMs have demonstrated potential in visual contexts, their capabilities in visual coding tasks have not been thoroughly evaluated. Plot2Code offers a platform for comprehensive assessment of these models.

    🤗 To enable individuals to ascertain the proficiency of AI assistants in generating code that… See the full description on the dataset page: https://huggingface.co/datasets/TencentARC/Plot2Code.

  20. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
Organization logo

AIMO-24: Model (openai-community/gpt2-large)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2-large')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

[{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
 {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
 {'generated_text': "Hello, I'm a language model, why does this matter for you?

When I hear new languages, I tend to start thinking in terms"},
 {'generated_text': "Hello, I'm a language model, a functional language...

I don't need to know anything else. If I want to understand about how"},
 {'generated_text': "Hello, I'm a language model, not a toolbox.

In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = TFGPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

  • Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
  • Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
  • Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...

Search
Clear search
Close search
Google apps
Main menu