Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-jokes
Dataset Summary
This corpus contains a million posts from /r/jokes. Posts are annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.
Programming Jokes Dataset
Dataset Summary
This dataset contains programming-related jokes scraped from the website Punny Funny. The jokes are organized into different categories based on the structure of the original webpage. The dataset is intended for use in natural language processing tasks, such as fine-tuning language models to generate humor or analyze textual content in the programming domain. Number of Jokes: [220]
Usage
This dataset is suitable for… See the full description on the dataset page: https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.
rayhanti/jokes-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Joke-tionary jokes : more than 444 jokes for kids!. It features 7 columns including author, publication date, language, and book publisher.
The archived data consist of jokes, anecdotes and other humorous texts distributed through email messages. The researcher sent the request for email humour to the staff members of the Department of Sociology and Social Psychology at the University of Tampere, Finland, in February 2003. The staff members in their turn distributed the request further. Texts were received from university staff members and students as well as from outsiders. Data collection continued till the year 2004. The total number of email messages received was 217, some of which contained more than one joke or anecdote. The jokes/anecdotes were mostly in Finnish, but approximately 20% were in English. The themes of the email messages varied greatly. Many were connected to current events, for instance, the Iraq war, September 11 terrorist attacks in the USA, and the doping scandal of Finnish skiers in 2001. Other recurring themes included sexuality, gender and ethnicity stereotypes, and professional jokes. As is typical in email humor, the original creator of the jokes/anecdotes often remained unknown. The dataset is only available in the original languages.
This dataset contains 38,269 jokes of the question-answer form, obtained from the r/Jokes subreddit. The dataset contains a csv file, where a row contains a question ("Why did the chicken cross the road"), the corresponding answer ("To get to the other side") and a unique ID.
The data comes from the end of 2016 all the way to 2008. The entries with a higher ID correspond to the ones submitted earlier.
An example of what one might do with the data is build a sequence-to-sequence model where the input is a question and the output is an answer. Then, given a question, the model should generate a funny answer. This is what I did as the final project for my fall 2016 machine learning class. The project page can be viewed here.
Disclaimer: The dataset contains jokes that some may find inappropriate.
Released under reddit's API terms
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The funniness of joke is very subjective. Having more than 70,000 users rate jokes, can an algorithm be written to identify the universally funny joke?
The dataset is associated with the below research paper.
Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.
More information and datasets can be found at http://eigentaste.berkeley.edu/dataset/
Since funniness is a very subjective matter, it will be very interesting to see if data science can bring out the details on what makes something funny.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Jokes, jests and jollies. It features 7 columns including author, publication date, language, and book publisher.
This dataset was created by Jacopo Le Pera
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Dirty jokes every man should know. It features 7 columns including author, publication date, language, and book publisher.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Jokes and fun. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 3 rows and is filtered where the book is Monster jokes. It features 7 columns including author, publication date, language, and book publisher.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Avneet Singh
Released under Apache 2.0
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations.
Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus.
Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm.
The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)
The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Avneet Singh
Released under Apache 2.0
http://www.gnu.org/licenses/fdl-1.3.htmlhttp://www.gnu.org/licenses/fdl-1.3.html
Generated by ChatGPT, this dataset is a compilation of humorous jokes and their corresponding responses. With a wide range of topics, from animals to everyday situations, these light-hearted interactions showcase the playful side of AI. Whether you're in need of a quick chuckle or a mood boost, let ChatGPT's witty banter provide the laughter you seek.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Bangla Jokes Dataset
The Bangla Jokes Dataset is a collection of humorous text samples written in Bengali (Bangla). This dataset is intended for NLP research and model training, especially in the area of Bangla-language humor generation, sentiment, or cultural studies. It is one of the first attempts to gather a sizable dataset of jokes in Bangla for open-source use.
Dataset Details
Curated by: Adnan1837 Funded by: No one Shared by: Adnan, Md.… See the full description on the dataset page: https://huggingface.co/datasets/adnan1837/Bangla_jokes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-jokes
Dataset Summary
This corpus contains a million posts from /r/jokes. Posts are annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.