Copy of Kaggle dataset, adding to Huggingface for ease of use.
Description from Kaggle:
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-jokes
Dataset Summary
This corpus contains a million posts from /r/jokes. Posts are annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.
rayhanti/programming-jokes-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
My goal with this dataset is to create the largest and most organized dataset of jokes.
Tools for this dataset are on my Github
Question-Answer Jokes by Jiri Roznovjak
Short Jokes by Abhinav Moudgil
Humor is one of the most difficult domains of natural language processing.
If you want to help rate the jokes based on funniness and/or vulgarity, download the .csv and make new column(s) with your rating(s). Email that to bfinan@iastate.edu, and I'll add your ratings as part of the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Joke-tionary jokes : more than 444 jokes for kids!. It features 7 columns including author, publication date, language, and book publisher.
The archived data consist of jokes, anecdotes and other humorous texts distributed through email messages. The researcher sent the request for email humour to the staff members of the Department of Sociology and Social Psychology at the University of Tampere, Finland, in February 2003. The staff members in their turn distributed the request further. Texts were received from university staff members and students as well as from outsiders. Data collection continued till the year 2004. The total number of email messages received was 217, some of which contained more than one joke or anecdote. The jokes/anecdotes were mostly in Finnish, but approximately 20% were in English. The themes of the email messages varied greatly. Many were connected to current events, for instance, the Iraq war, September 11 terrorist attacks in the USA, and the doping scandal of Finnish skiers in 2001. Other recurring themes included sexuality, gender and ethnicity stereotypes, and professional jokes. As is typical in email humor, the original creator of the jokes/anecdotes often remained unknown. The dataset is only available in the original languages.
https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Short Jokes Punchline
This dataset contains information about jokes, visitors, labels, and label segments used in a joke labeling application. The data is stored in four CSV files: joke.csv, visitor.csv, label.csv, and label_segment.csv.
Files
joke.csv
This file contains 200 jokes randomly sampled from the Kaggle dataset "Short Jokes." Each row represents a joke with the following columns:
id: The unique identifier for the joke. text: The text content of the… See the full description on the dataset page: https://huggingface.co/datasets/Timxjl/short-jokes-punchline.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a good way to practice in text classification. Try to predict the theme of the joke from the text. Or define more rated joke.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Jokes, jests and jollies. It features 7 columns including author, publication date, language, and book publisher.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)
This dataset was created by Jacopo Le Pera
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Dirty jokes every man should know. It features 7 columns including author, publication date, language, and book publisher.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a collection of Hungarian Internet humour connected to the Covid-19 pandemic. The collection includes 344 items (mostly jokes and memes) that were collected online during the first wave of Covid in Hungary, between January and June 2020.
this data is collected from various sources.
It contains user's whos id is given and joke_id with Rating. Another file contains the jokes with their Ids
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
kentfoong/jokes dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Avneet Singh
Released under Apache 2.0
This dataset was created by Kirill Ovcharenko
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Abstract: Jokes between countries are useful to reveal the ethnotypes existing in each of them, to represent them in cartographic form allows to perceive their distribution and the spatial projection of mockery: who are we laughing at, who are the scapegoats for the inhabitants of each country? Based on the analysis of an ad hoc database covering more than 60% of the countries and territories of the world and 90% of its population, the text shows that these jokes are social constructions, have a temporality and are divided basically in two categories, from top to bottom and from bottom to top.
This dataset contains the predicted prices of the asset dad jokes over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Copy of Kaggle dataset, adding to Huggingface for ease of use.
Description from Kaggle:
Context
Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
Visit my Github repository for more information regarding collection of data and the scripts used.
Content
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.
Disclaimer
It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.