2 datasets found

The reddit self-post classification task
kaggle.com
Updated Oct 29, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Swarbrick Jones (2018). The reddit self-post classification task [Dataset]. https://www.kaggle.com/mswarbrickjones/reddit-selfposts/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mike Swarbrick Jones
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

Welcome to the Reddit Self-Post Classification Task (RSPCT)!

The aim of this dataset was to create an interesting, large text classification problem with many classes, that does not suffer from label sparsity as most datasets of its type do. See the blog post for a more detailed write up, or the paper here. The aim is to classify self-posts into the subreddit into which they were posted. A great deal of effort has gone into selecting a ‘good’ set of subreddits to minimise overlap in content.

We recommend you look at the blogpost write-up for this dataset before continuing. There is also a rough draft of a paper here if you have more detailed questions.

Data

The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

We have also given a manual annotation of about 3000 subreddits which went into the creation of this dataset, in subreddit_info.csv, this was the main criteria for selecting which subreddits went into this dataset. We include a top-level category and subcategory for each subreddit, and a reason for exclusion if this does not appear in the data.

Recommendations

We recommend splitting out the last 20% of the data as a test set (we have organised so that this is a random, stratified sample of all the data. In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}

Questions that we think would be interesting to answer

can sequential models (e.g. LSTMs) be trained to be competitive with / outperform bag-of-word approaches?

does transfer learning (e.g. OpenAI, ULMFIT) help on this problem? You may want to look at the GitHub page (https://github.com/mikesj-public/rspct-dataset/tree/master) to get hold of a unsupervised training set.

can you leverage a hierarchy (such as the one detailed in subreddit_info.csv), to improve accuracy?

can you use techniques from XML (extreme multi-class) machine learning to get a better score on this dataset?
h
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mike Swarbrick Jones (2018). The reddit self-post classification task [Dataset]. https://www.kaggle.com/mswarbrickjones/reddit-selfposts/code

The reddit self-post classification task

Classify reddit self-posts into over 1000 carefully selected categories

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 29, 2018

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mike Swarbrick Jones

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Introduction

Welcome to the Reddit Self-Post Classification Task (RSPCT)!

The aim of this dataset was to create an interesting, large text classification problem with many classes, that does not suffer from label sparsity as most datasets of its type do. See the blog post for a more detailed write up, or the paper here. The aim is to classify self-posts into the subreddit into which they were posted. A great deal of effort has gone into selecting a ‘good’ set of subreddits to minimise overlap in content.

We recommend you look at the blogpost write-up for this dataset before continuing. There is also a rough draft of a paper here if you have more detailed questions.

Data

The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

We have also given a manual annotation of about 3000 subreddits which went into the creation of this dataset, in subreddit_info.csv, this was the main criteria for selecting which subreddits went into this dataset. We include a top-level category and subcategory for each subreddit, and a reason for exclusion if this does not appear in the data.

Recommendations

We recommend splitting out the last 20% of the data as a test set (we have organised so that this is a random, stratified sample of all the data. In our experiments, we have been optimising for the precision-at-K metric for K = {1, 3, 5}

Questions that we think would be interesting to answer

can sequential models (e.g. LSTMs) be trained to be competitive with / outperform bag-of-word approaches?
does transfer learning (e.g. OpenAI, ULMFIT) help on this problem? You may want to look at the GitHub page (https://github.com/mikesj-public/rspct-dataset/tree/master) to get hold of a unsupervised training set.
can you leverage a hierarchy (such as the one detailed in subreddit_info.csv), to improve accuracy?
can you use techniques from XML (extreme multi-class) machine learning to get a better score on this dataset?

Clear search

Close search

Google apps

Main menu

The reddit self-post classification task

Introduction

Data

Recommendations

Questions that we think would be interesting to answer

openai_humaneval

The reddit self-post classification task

Classify reddit self-posts into over 1000 carefully selected categories

Introduction

Data

Recommendations

Questions that we think would be interesting to answer