14 datasets found
  1. h

    CanItEdit

    • huggingface.co
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Northeastern University Programming Research Lab (2024). CanItEdit [Dataset]. https://huggingface.co/datasets/nuprl/CanItEdit
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2024
    Dataset authored and provided by
    Northeastern University Programming Research Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

    CanItEdit is a benchmark for evaluating LLMs on instructional code editing, the task of updating a program given a natural language instruction. The benchmark contains 105 hand-crafted Python programs with before and after code blocks, two types of natural language instructions (descriptive and lazy), and a hidden test suite. The dataset’s dual natural language instructions test model… See the full description on the dataset page: https://huggingface.co/datasets/nuprl/CanItEdit.

  2. h

    PIPE_Masks

    • huggingface.co
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    paint-by-inpaint (2024). PIPE_Masks [Dataset]. https://huggingface.co/datasets/paint-by-inpaint/PIPE_Masks
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset authored and provided by
    paint-by-inpaint
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for PIPE Masks Dataset

      Dataset Summary
    

    The PIPE (Paint by InPaint Edit) dataset is designed to enhance the efficacy of mask-free, instruction-following image editing models by providing a large-scale collection of image pairs and diverse object addition instructions. Here, we provide the masks used for the inpainting process to generate the source image for the PIPE dataset for both the train and test sets. Further details can be found in our project page… See the full description on the dataset page: https://huggingface.co/datasets/paint-by-inpaint/PIPE_Masks.

  3. h

    VectorEdits

    • huggingface.co
    Updated Jun 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MikronAI (2025). VectorEdits [Dataset]. https://huggingface.co/datasets/mikronai/VectorEdits
    Explore at:
    Dataset updated
    Jun 7, 2025
    Dataset authored and provided by
    MikronAI
    Description

    VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics

    NOTE: Currently only test set has generated labels, other sets will have them soon Paper (Soon) We introduce a large-scale dataset for instruction-guided vector image editing, consisting of over 270,000 pairs of SVG images paired with natural language edit instructions. Our dataset enables training and evaluation of models that modify vector graphics based on textual commands. We describe the data… See the full description on the dataset page: https://huggingface.co/datasets/mikronai/VectorEdits.

  4. h

    MagicBrush

    • huggingface.co
    Updated Jun 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OSU NLP Group (2023). MagicBrush [Dataset]. https://huggingface.co/datasets/osunlp/MagicBrush
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2023
    Dataset authored and provided by
    OSU NLP Group
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for MagicBrush

      Dataset Summary
    

    MagicBrush is the first large-scale, manually-annotated instruction-guided image editing dataset covering diverse scenarios single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises 10K (source image, instruction, target image) triples, which is sufficient to train large-scale image editing models. Please check our website to explore more visual results.

      Dataset Structure
    

    "img_id" (str):… See the full description on the dataset page: https://huggingface.co/datasets/osunlp/MagicBrush.

  5. P

    Data from: ImageNet Dataset

    • paperswithcode.com
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2021). ImageNet Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet
    Explore at:
    Dataset updated
    Apr 15, 2024
    Authors
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li
    Description

    The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

    Total number of non-empty WordNet synsets: 21841 Total number of images: 14197122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million

  6. coedit

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grammarly (2023). coedit [Dataset]. https://huggingface.co/datasets/grammarly/coedit
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    Grammarlyhttp://grammarly.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for CoEdIT: Text Editing via Instruction Tuning

      Paper: CoEdIT: Text Editing by Task-Specific Instruction Tuning
    
    
    
    
    
      Authors: Vipul Raheja, Dhruv Kumar, Ryan Koo, Dongyeop Kang
    
    
    
    
    
      Project Repo: https://github.com/vipulraheja/coedit
    
    
    
    
    
      Dataset Summary
    

    This is the dataset that was used to train the CoEdIT text editing models. Full details of the dataset can be found in our paper.

      Dataset Structure
    

    The dataset is in JSON format.… See the full description on the dataset page: https://huggingface.co/datasets/grammarly/coedit.

  7. h

    OpenOrca

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🐋 The OpenOrca Dataset! 🐋

    We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

      Official Models
    
    
    
    
    
    
      Mistral-7B-OpenOrca
    

    Our latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

  8. dolly_hhrlhf

    • huggingface.co
    • opendatalab.com
    Updated Oct 15, 1997
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mosaic ML, Inc. (1997). dolly_hhrlhf [Dataset]. https://huggingface.co/datasets/mosaicml/dolly_hhrlhf
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 1997
    Dataset authored and provided by
    Mosaic ML, Inc.
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "dolly_hhrlhf"

    This dataset is a combination of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF. It also includes a test split, which was missing in the original dolly set. That test set is composed of 200 randomly selected samples from dolly + 4,929 of the test set samples from HH-RLHF which made it through the filtering process. The train set contains 59,310 samples; 15,014 - 200 = 14,814 from Dolly, and the remaining 44,496 from… See the full description on the dataset page: https://huggingface.co/datasets/mosaicml/dolly_hhrlhf.

  9. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  10. h

    toxic-chat

    • huggingface.co
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2024). toxic-chat [Dataset]. https://huggingface.co/datasets/lmsys/toxic-chat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Large Model Systems Organization
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Update

    [01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!

      Content
    

    This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.

  11. commit-msg-edits

    • huggingface.co
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). commit-msg-edits [Dataset]. https://huggingface.co/datasets/JetBrains-Research/commit-msg-edits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    Description

    ✍️ Commit Message Edits Dataset

    This dataset is a collection of expert-labeled commit message edits contributed via Commit Message Editing app presented in Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings. Labelers were presented with GPT-4 generated messages for 15 commits from CMG benchmark from Long Code Arena and asked to manually edit them to be of good enough quality to submit to VCS. You can check Manual tab in our… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/commit-msg-edits.

  12. IWSLT_2016

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FrancophonIA, IWSLT_2016 [Dataset]. https://huggingface.co/datasets/FrancophonIA/IWSLT_2016
    Explore at:
    Dataset provided by
    Francophonia
    Authors
    FrancophonIA
    Description

    Dataset origin: https://live.european-language-grid.eu/catalogue/corpus/709/

      Description
    

    The human evaluation (HE) dataset created for English to German (EnDe) and English to French (EnFr) MT tasks was a subset of one of the official test sets of the IWSLT 2016 evaluation campaign. The resulting HE sets are composed of 600 segments for both EnDe and EnFr, each corresponding to around 10,000 words. Human evaluation was based on Post-Editing, i.e. the manual correction of the MT… See the full description on the dataset page: https://huggingface.co/datasets/FrancophonIA/IWSLT_2016.

  13. h

    commonsense_qa

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tel Aviv University (2022). commonsense_qa [Dataset]. https://huggingface.co/datasets/tau/commonsense_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2022
    Dataset authored and provided by
    Tel Aviv University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "commonsense_qa"

      Dataset Summary
    

    CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.… See the full description on the dataset page: https://huggingface.co/datasets/tau/commonsense_qa.

  14. atomic

    • huggingface.co
    • paperswithcode.com
    • +2more
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). atomic [Dataset]. https://huggingface.co/datasets/allenai/atomic
    Explore at:
    Dataset updated
    May 25, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides the template sentences and relationships defined in the ATOMIC common sense dataset. There are three splits - train, test, and dev.

    From the authors.

    Disclaimer/Content warning: the events in atomic have been automatically extracted from blogs, stories and books written at various times. The events might depict violent or problematic actions, which we left in the corpus for the sake of learning the (probably negative but still important) commonsense implications associated with the events. We removed a small set of truly out-dated events, but might have missed some so please email us (msap@cs.washington.edu) if you have any concerns.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Northeastern University Programming Research Lab (2024). CanItEdit [Dataset]. https://huggingface.co/datasets/nuprl/CanItEdit

CanItEdit

CanItEdit

nuprl/CanItEdit

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2024
Dataset authored and provided by
Northeastern University Programming Research Lab
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

CanItEdit is a benchmark for evaluating LLMs on instructional code editing, the task of updating a program given a natural language instruction. The benchmark contains 105 hand-crafted Python programs with before and after code blocks, two types of natural language instructions (descriptive and lazy), and a hidden test suite. The dataset’s dual natural language instructions test model… See the full description on the dataset page: https://huggingface.co/datasets/nuprl/CanItEdit.

Search
Clear search
Close search
Google apps
Main menu