2 datasets found
  1. The reddit self-post classification task

    • kaggle.com
    Updated Oct 29, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Swarbrick Jones (2018). The reddit self-post classification task [Dataset]. https://www.kaggle.com/mswarbrickjones/reddit-selfposts/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mike Swarbrick Jones
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    Welcome to the Reddit Self-Post Classification Task (RSPCT)!

    The aim of this dataset was to create an interesting, large text classification problem with many classes, that does not suffer from label sparsity as most datasets of its type do. See the blog post for a more detailed write up, or the paper here. The aim is to classify self-posts into the subreddit into which they were posted. A great deal of effort has gone into selecting a ‘good’ set of subreddits to minimise overlap in content.

    We recommend you look at the blogpost write-up for this dataset before continuing. There is also a rough draft of a paper here if you have more detailed questions.

    Data

    The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

    We have also given a manual annotation of about 3000 subreddits which went into the creation of this dataset, in subreddit_info.csv, this was the main criteria for selecting which subreddits went into this dataset. We include a top-level category and subcategory for each subreddit, and a reason for exclusion if this does not appear in the data.

    Recommendations

    We recommend splitting out the last 20% of the data as a test set (we have organised so that this is a random, stratified sample of all the data. In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}

    Questions that we think would be interesting to answer

    • can sequential models (e.g. LSTMs) be trained to be competitive with / outperform bag-of-word approaches?
    • does transfer learning (e.g. OpenAI, ULMFIT) help on this problem? You may want to look at the GitHub page (https://github.com/mikesj-public/rspct-dataset/tree/master) to get hold of a unsupervised training set.
    • can you leverage a hierarchy (such as the one detailed in subreddit_info.csv), to improve accuracy?
    • can you use techniques from XML (extreme multi-class) machine learning to get a better score on this dataset?
  2. h

    openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mike Swarbrick Jones (2018). The reddit self-post classification task [Dataset]. https://www.kaggle.com/mswarbrickjones/reddit-selfposts/code
Organization logo

The reddit self-post classification task

Classify reddit self-posts into over 1000 carefully selected categories

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mike Swarbrick Jones
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Introduction

Welcome to the Reddit Self-Post Classification Task (RSPCT)!

The aim of this dataset was to create an interesting, large text classification problem with many classes, that does not suffer from label sparsity as most datasets of its type do. See the blog post for a more detailed write up, or the paper here. The aim is to classify self-posts into the subreddit into which they were posted. A great deal of effort has gone into selecting a ‘good’ set of subreddits to minimise overlap in content.

We recommend you look at the blogpost write-up for this dataset before continuing. There is also a rough draft of a paper here if you have more detailed questions.

Data

The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

We have also given a manual annotation of about 3000 subreddits which went into the creation of this dataset, in subreddit_info.csv, this was the main criteria for selecting which subreddits went into this dataset. We include a top-level category and subcategory for each subreddit, and a reason for exclusion if this does not appear in the data.

Recommendations

We recommend splitting out the last 20% of the data as a test set (we have organised so that this is a random, stratified sample of all the data. In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}

Questions that we think would be interesting to answer

  • can sequential models (e.g. LSTMs) be trained to be competitive with / outperform bag-of-word approaches?
  • does transfer learning (e.g. OpenAI, ULMFIT) help on this problem? You may want to look at the GitHub page (https://github.com/mikesj-public/rspct-dataset/tree/master) to get hold of a unsupervised training set.
  • can you leverage a hierarchy (such as the one detailed in subreddit_info.csv), to improve accuracy?
  • can you use techniques from XML (extreme multi-class) machine learning to get a better score on this dataset?
Search
Clear search
Close search
Google apps
Main menu