Facebook
TwitterCopy of Kaggle dataset, adding to Huggingface for ease of use.
Description from Kaggle:
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Get ready to groan and giggle your way through 'Grin and Dad Joke It,' a dataset that's all about those classic, eye-roll-inducing dad jokes. This pun-tastic collection brings together a treasure trove of one-liners, puns, and witty quips that dads everywhere love to share. Whether you're a dad joke aficionado or just looking to add some humor to your day, this dataset is your go-to source for timeless, family-friendly humor. From cheesy wordplay to clever punchlines, 'Grin and Dad Joke It' has you covered, ensuring that a chuckle is just a punchline away.
And the fun never stops! With 200 new jokes added daily, 'Grin and Dad Joke It' keeps the laughter flowing and your pun tolerance growing. It's a never-ending source of dad-approved humor that's always fresh and ready to make you smile.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.
Facebook
TwitterThis dataset was created by Yaroslav-siberia
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-jokes
Dataset Summary
This corpus contains a million posts from /r/jokes. Posts are annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
My goal with this dataset is to create the largest and most organized dataset of jokes.
Tools for this dataset are on my Github
Question-Answer Jokes by Jiri Roznovjak
Short Jokes by Abhinav Moudgil
Humor is one of the most difficult domains of natural language processing.
If you want to help rate the jokes based on funniness and/or vulgarity, download the .csv and make new column(s) with your rating(s). Email that to bfinan@iastate.edu, and I'll add your ratings as part of the dataset.
Facebook
TwitterProgramming Jokes Dataset
Dataset Summary
This dataset contains programming-related jokes scraped from the website Punny Funny. The jokes are organized into different categories based on the structure of the original webpage. The dataset is intended for use in natural language processing tasks, such as fine-tuning language models to generate humor or analyze textual content in the programming domain. Number of Jokes: [220]
Usage
This dataset is suitable for… See the full description on the dataset page: https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset.
Facebook
Twitterrayhanti/jokes-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Fraser Greenlee (From Huggingface) [source]
This dataset offers a valuable resource for various applications such as natural language processing, sentiment analysis, joke generation algorithms, or simply for entertainment purposes. Whether you're a data scientist looking to analyze humor patterns or an individual seeking some quick comedic relief, this dataset has got you covered.
By utilizing this dataset, researchers can explore different aspects of humor and study the linguistic features that make these short jokes amusing. Moreover, it provides an opportunity for developing computer models capable of generating similar humorous content based on learned patterns.
Understanding the Columns:
text: This column contains the text of the short joke.**text: No information is provided about this column.Exploring the Jokes:
- Start by exploring the
textcolumn, which contains the actual jokes. You can read through them and have a good laugh!Analyzing the Jokes:
- To gain insights from this dataset, you can perform various analyses:
- Sentiment Analysis: Use Natural Language Processing techniques to analyze the sentiment of each joke.
- Categorization: Group jokes based on common themes or subjects, such as animals, professions, etc.
- Length Distribution: Analyze and visualize the distribution of joke lengths.
Creating New Content or Applications: Since this dataset provides a large collection of short jokes, you can utilize it creatively:
- Generating Random Jokes: Develop an algorithm that generates new jokes based on patterns found in this dataset.
- Humor Classification: Build a model that predicts if a given piece of text is funny or not using machine learning techniques.
Sharing Your Findings: If you make interesting discoveries or create unique applications using this dataset, consider sharing them with others in Kaggle community.
Please note that no information regarding dates is available in train.csv; therefore, any temporal analysis or date-based insights won't be feasible with this specific file.
- Analyzing humor patterns: This dataset can be used to analyze different types of humor and identify patterns or common elements in jokes that make them funny. Researchers and linguists can use this dataset to gain insights into the structure, wordplay, or comedic techniques used in short jokes.
- Natural language processing: With the text data available in this dataset, it can be used for training models in natural language processing (NLP) tasks such as sentiment analysis, joke generation, or understanding humor from written text. NLP researchers and developers can utilize this dataset to build and improve algorithms for detecting or generating funny content.
- Social media analysis: Short jokes are popular on social media platforms like Twitter or Reddit where users frequently share humorous content. This dataset can be valuable for analyzing the reception and impact of these jokes on social media platforms. By examining trends, engagement metrics, or user reactions to specific jokes from the dataset, marketers or social media analysts can gain insights into what type of humor resonates with different online communities. Overall, this dataset provides a rich resource for exploring various aspects related to humor analysis and NLP tasks while offering opportunities for sociocultural studies related to online comedy culture
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:----------------------------------------------| | text | The actual content of the short jokes. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Fraser Greenlee (From Huggingface).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Jokes, jests and jollies. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 3 rows and is filtered where the book is Monster jokes. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a collection of Hungarian Internet humour connected to the Covid-19 pandemic. The collection includes 344 items (mostly jokes and memes) that were collected online during the first wave of Covid in Hungary, between January and June 2020.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
kentfoong/jokes dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset contains the predicted prices of the asset dad jokes over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations.
Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus.
Based on the name ("Å ala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm.
The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)
The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Abstract: Jokes between countries are useful to reveal the ethnotypes existing in each of them, to represent them in cartographic form allows to perceive their distribution and the spatial projection of mockery: who are we laughing at, who are the scapegoats for the inhabitants of each country? Based on the analysis of an ad hoc database covering more than 60% of the countries and territories of the world and 90% of its population, the text shows that these jokes are social constructions, have a temporality and are divided basically in two categories, from top to bottom and from bottom to top.
Facebook
Twitterhttps://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Short Jokes Punchline
This dataset contains information about jokes, visitors, labels, and label segments used in a joke labeling application. The data is stored in four CSV files: joke.csv, visitor.csv, label.csv, and label_segment.csv.
Files
joke.csv
This file contains 200 jokes randomly sampled from the Kaggle dataset "Short Jokes." Each row represents a joke with the following columns:
id: The unique identifier for the joke. text: The text content of the… See the full description on the dataset page: https://huggingface.co/datasets/Timxjl/short-jokes-punchline.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 4 rows and is filtered where the books is Jokes my father never taught me : life, love, and loss with Richard Pryor. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterThis dataset was created by Mohamed Amine DHIAB
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
notsobad9527/chinese-joke dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCopy of Kaggle dataset, adding to Huggingface for ease of use.
Description from Kaggle:
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.