100+ datasets found
  1. P

    CodeContests Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). CodeContests Dataset [Dataset]. https://paperswithcode.com/dataset/codecontests
    Explore at:
    Dataset updated
    Jun 3, 2025
    Description

    CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

    It consists of programming problems, from a variety of sources.

    Problems include test cases in the form of paired inputs and outputs, as well as both correct and incorrect human solutions in a variety of languages.

  2. h

    paperswithcode

    • huggingface.co
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Wilinski (2023). paperswithcode [Dataset]. https://huggingface.co/datasets/J0nasW/paperswithcode
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2023
    Authors
    Jonas Wilinski
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A cleaned dataset from paperswithcode.com

    Last dataset update: July 2023 This is a cleaned up dataset optained from paperswithcode.com through their API service. It represents a set of around 56K carefully categorized papers into 3K tasks and 16 areas. The papers contain arXiv and NIPS IDs as well as title, abstract and other meta information. It can be used for training text classifiers that concentrate on the use of specific AI and ML methods and frameworks.

      Contents… See the full description on the dataset page: https://huggingface.co/datasets/J0nasW/paperswithcode.
    
  3. P

    Something-Something V2 Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic, Something-Something V2 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v2
    Explore at:
    Authors
    Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
    Description

    The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 220,847 videos, with 168,913 in the training set, 24,777 in the validation set and 27,157 in the test set. There are 174 labels.

    Source

    Image Source

  4. P

    MNIST Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. LeCun; L. Bottou; Y. Bengio; P. Haffner, MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/mnist
    Explore at:
    Authors
    Y. LeCun; L. Bottou; Y. Bengio; P. Haffner
    Description

    The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

  5. P

    Data from: ImageNet Dataset

    • paperswithcode.com
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2021). ImageNet Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li
    Description

    The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

    Total number of non-empty WordNet synsets: 21841 Total number of images: 14197122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million

  6. P

    COCO (Common Objects in Context) Dataset

    • paperswithcode.com
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). COCO (Common Objects in Context) Dataset [Dataset]. https://paperswithcode.com/dataset/coco
    Explore at:
    Dataset updated
    Oct 4, 2023
    Description

    The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to encourage research on a wide variety of object categories and is commonly used for benchmarking computer vision models. It is an essential dataset for researchers and developers working on object detection, segmentation, and pose estimation tasks.

  7. P

    LIAR Dataset

    • paperswithcode.com
    Updated Jan 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Yang Wang (2021). LIAR Dataset [Dataset]. https://paperswithcode.com/dataset/liar
    Explore at:
    Dataset updated
    Jan 10, 2021
    Authors
    William Yang Wang
    Description

    LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. The LIAR dataset4 includes 12.8K human labeled short statements from POLITIFACT.COM’s API, and each statement is evaluated by a POLITIFACT.COM editor for its truthfulness.

  8. P

    Malimg Dataset

    • paperswithcode.com
    Updated Nov 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nataraj L.; Karthikeyan S.; Jacob G.; Manjunath B. S. (2022). Malimg Dataset [Dataset]. https://paperswithcode.com/dataset/malimg
    Explore at:
    Dataset updated
    Nov 8, 2022
    Authors
    Nataraj L.; Karthikeyan S.; Jacob G.; Manjunath B. S.
    Description

    The Malimg Dataset contains 9,339 malware byteplot images from 25 different families.

  9. P

    Cora Dataset

    • paperswithcode.com
    • huggingface.co
    • +1more
    Updated Feb 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew McCallum; Kamal Nigam; Jason Rennie; Kristie Seymore (2020). Cora Dataset [Dataset]. https://paperswithcode.com/dataset/cora
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Andrew McCallum; Kamal Nigam; Jason Rennie; Kristie Seymore
    Description

    The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

  10. P

    CIFAR-10 Dataset

    • paperswithcode.com
    • universe.roboflow.com
    Updated Jun 14, 2009
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krizhevsky (2009). CIFAR-10 Dataset [Dataset]. https://paperswithcode.com/dataset/cifar-10
    Explore at:
    Dataset updated
    Jun 14, 2009
    Authors
    Krizhevsky
    Description

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

  11. P

    CodeSearchNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
    Description

    The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

  12. P

    LOL Dataset

    • paperswithcode.com
    Updated Feb 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Wei; Wenjing Wang; Wenhan Yang; Jiaying Liu (2021). LOL Dataset [Dataset]. https://paperswithcode.com/dataset/lol
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Chen Wei; Wenjing Wang; Wenhan Yang; Jiaying Liu
    Description

    The LOL dataset is composed of 500 low-light and normal-light image pairs and divided into 485 training pairs and 15 testing pairs. The low-light images contain noise produced during the photo capture process. Most of the images are indoor scenes. All the images have a resolution of 400×600.

  13. P

    ImageNet-A Dataset

    • paperswithcode.com
    Updated Dec 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Kevin Zhao; Steven Basart; Jacob Steinhardt; Dawn Song (2023). ImageNet-A Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-a
    Explore at:
    Dataset updated
    Dec 20, 2023
    Authors
    Dan Hendrycks; Kevin Zhao; Steven Basart; Jacob Steinhardt; Dawn Song
    Description

    The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models.

  14. P

    Cityscapes Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius Cordts; Mohamed Omran; Sebastian Ramos; Timo Rehfeld; Markus Enzweiler; Rodrigo Benenson; Uwe Franke; Stefan Roth; Bernt Schiele (2020). Cityscapes Dataset [Dataset]. https://paperswithcode.com/dataset/cityscapes
    Explore at:
    Dataset updated
    May 19, 2020
    Authors
    Marius Cordts; Mohamed Omran; Sebastian Ramos; Timo Rehfeld; Markus Enzweiler; Rodrigo Benenson; Uwe Franke; Stefan Roth; Bernt Schiele
    Description

    Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.

  15. P

    GSM8K Dataset

    • paperswithcode.com
    • tensorflow.org
    • +2more
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
    Explore at:
    Dataset updated
    Dec 31, 2024
    Authors
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
    Description

    GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

  16. P

    APPS Dataset

    • paperswithcode.com
    Updated Feb 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt (2024). APPS Dataset [Dataset]. https://paperswithcode.com/dataset/apps
    Explore at:
    Dataset updated
    Feb 28, 2024
    Authors
    Dan Hendrycks; Steven Basart; Saurav Kadavath; Mantas Mazeika; Akul Arora; Ethan Guo; Collin Burns; Samir Puranik; Horace He; Dawn Song; Jacob Steinhardt
    Description

    The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.

    The Automated Programming Progress Standard, abbreviated APPS, consists of 10,000 coding problems in total, with 131,836 test cases for checking solutions and 232,444 ground-truth solutions written by humans. Problems can be complicated, as the average length of a problem is 293.2 words. The data are split evenly into training and test sets, with 5,000 problems each. In the test set, every problem has multiple test cases, and the average number of test cases is 21.2. Each test case is specifically designed for the corresponding problem, enabling us to rigorously evaluate program functionality.

  17. P

    ARC (AI2 Reasoning Challenge) Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Clark; Isaac Cowhey; Oren Etzioni; Tushar Khot; Ashish Sabharwal; Carissa Schoenick; Oyvind Tafjord (2021). ARC (AI2 Reasoning Challenge) Dataset [Dataset]. https://paperswithcode.com/dataset/arc
    Explore at:
    Dataset updated
    Feb 15, 2021
    Authors
    Peter Clark; Isaac Cowhey; Oren Etzioni; Tushar Khot; Ashish Sabharwal; Carissa Schoenick; Oyvind Tafjord
    Description

    The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. ARC includes a supporting KB of 14.3M unstructured text passages.

  18. P

    LSUN Dataset

    • paperswithcode.com
    • tensorflow.org
    • +1more
    Updated Jan 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fisher Yu; Ari Seff; yinda zhang; Shuran Song; Thomas Funkhouser; Jianxiong Xiao (2021). LSUN Dataset [Dataset]. https://paperswithcode.com/dataset/lsun
    Explore at:
    Dataset updated
    Jan 27, 2021
    Authors
    Fisher Yu; Ari Seff; yinda zhang; Shuran Song; Thomas Funkhouser; Jianxiong Xiao
    Description

    The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN classification dataset contains 10 scene categories, such as dining room, bedroom, chicken, outdoor church, and so on. For training data, each category contains a huge number of images, ranging from around 120,000 to 3,000,000. The validation data includes 300 images, and the test data has 1000 images for each category.

  19. P

    Pinterest Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xue Geng; Hanwang Zhang; Jingwen Bian; Tat-Seng Chua (2021). Pinterest Dataset [Dataset]. https://paperswithcode.com/dataset/pinterest
    Explore at:
    Dataset updated
    Feb 20, 2021
    Authors
    Xue Geng; Hanwang Zhang; Jingwen Bian; Tat-Seng Chua
    Description

    The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them.

  20. P

    SIPaKMeD Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SIPaKMeD Dataset [Dataset]. https://paperswithcode.com/dataset/sipakmed
    Explore at:
    Description

    a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). CodeContests Dataset [Dataset]. https://paperswithcode.com/dataset/codecontests

CodeContests Dataset

Explore at:
238 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 3, 2025
Description

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

It consists of programming problems, from a variety of sources.

Problems include test cases in the form of paired inputs and outputs, as well as both correct and incorrect human solutions in a variety of languages.

Search
Clear search
Close search
Google apps
Main menu