Copy of Kaggle dataset, adding to Huggingface for ease of use.
Description from Kaggle:
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dad Jokes dataset
This dataset is generated from the Kaggle Reddit Dad Jokes by Oktay Ozturk, with the following modifications:
Only jokes with 5+ votes were sampled. Less upvoted jokes are too cringe. With a set of heuristics, each joke was split into two parts: base and the punchline.
Format
The dataset is formatted as a CSV, and is split into train/test parts:
train: 52000 samples test: 1400 samples
"question","response" "I asked my priest how he gets holy… See the full description on the dataset page: https://huggingface.co/datasets/shuttie/dadjokes.
This dataset was created by Mohamed Amine DHIAB
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.
6.5 million anonymous ratings of jokes by users of the Jester Joke Recommender System.
https://choosealicense.com/licenses/creativeml-openrail-m/https://choosealicense.com/licenses/creativeml-openrail-m/
Amirkid/jokes dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset full of jokes ranged by score.
Some examples:
— Why did the invisible man quit his new job? — He just couldn't see himself doing it.
— A man sees a pregnant woman laughing — He asks the woman, she replies "Nothing, it's an inside joke!"
https://github.com/taivop/joke-dataset
Pungas, Taivo
This dataset was created by Yaroslav-siberia
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
My goal with this dataset is to create the largest and most organized dataset of jokes.
Tools for this dataset are on my Github
Question-Answer Jokes by Jiri Roznovjak
Short Jokes by Abhinav Moudgil
Humor is one of the most difficult domains of natural language processing.
If you want to help rate the jokes based on funniness and/or vulgarity, download the .csv and make new column(s) with your rating(s). Email that to bfinan@iastate.edu, and I'll add your ratings as part of the dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Hailuo AI Jokes Dataset 🎤
A curated collection of high-quality voice recordings with corresponding transcriptions and phoneme analysis. This dataset is designed for speech recognition, text-to-speech, and voice analysis tasks.
🎙️ Dataset Content
The dataset contains a diverse set of synthetic voice recordings generated by Hailuo AI Audio. The texts are sourced from a variety of public domain jokes and humorous anecdotes. Each audio sample is accompanied by the… See the full description on the dataset page: https://huggingface.co/datasets/unlimitedbytes/hailuo-ai-jokes.
Can an automated system recommend a funny joke? Jester is an online joke recommender system developed by Ken Goldberg and the team at UC Berkeley. Users are presented jokes through an HTML client interface and allowed to rate jokes. Once a user rates all jokes in the gauge set, the system recommends new jokes to the user.
Refer to the below research paper to have more ideas about the usefulness of the dataset.
Eigentaste: A Constant Time Collaborative Filtering Algorithm
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a good way to practice in text classification. Try to predict the theme of the joke from the text. Or define more rated joke.
The archived data consist of jokes, anecdotes and other humorous texts distributed through email messages. The researcher sent the request for email humour to the staff members of the Department of Sociology and Social Psychology at the University of Tampere, Finland, in February 2003. The staff members in their turn distributed the request further. Texts were received from university staff members and students as well as from outsiders. Data collection continued till the year 2004. The total number of email messages received was 217, some of which contained more than one joke or anecdote. The jokes/anecdotes were mostly in Finnish, but approximately 20% were in English. The themes of the email messages varied greatly. Many were connected to current events, for instance, the Iraq war, September 11 terrorist attacks in the USA, and the doping scandal of Finnish skiers in 2001. Other recurring themes included sexuality, gender and ethnicity stereotypes, and professional jokes. As is typical in email humor, the original creator of the jokes/anecdotes often remained unknown. The dataset is only available in the original languages.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset contains over a million jokes and the rating given to each joke by users
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books and is filtered where the book is Joke-tionary jokes : more than 444 jokes for kids!. It has 7 columns such as author, BNB id, book, book publisher, and ISBN. The data is ordered by publication date (descending).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books and is filtered where the book is Best teenage jokes. It has 7 columns such as book, author, ISBN, BNB id, and language. The data is ordered by publication date (descending).
This dataset was created by Kirill Ovcharenko
Programming Jokes Dataset
Dataset Summary
This dataset contains programming-related jokes scraped from the website Punny Funny. The jokes are organized into different categories based on the structure of the original webpage. The dataset is intended for use in natural language processing tasks, such as fine-tuning language models to generate humor or analyze textual content in the programming domain. Number of Jokes: [220]
Usage
This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset.
rayhanti/jokes-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Copy of Kaggle dataset, adding to Huggingface for ease of use.
Description from Kaggle:
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.