MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Web Scraped Database of Dad Jokes in the form of signature one-liners that possibly a dad could say and chuckle by himself while the rest of the family facepalms!
The dataset is created by collecting one liner dad jokes from icanhazdadjokes.
Future work includes cleaning reddit data and extracting jokes from popular books published in this genre.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
Corpus for testing whether your LLM can explain the joke well. But this is a rather small dataset, if someone can point to a larger ones would be very nice.
Languages
English
Dataset Structure
Data Fields
url : link to the explaination
joke : the original joke
explaination : the explaination of the joke
Data Splits
Since its so small, there's no splits just like gsm8k
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is developed for the CELSA research project 'Humour and Conflict in the Public Sphere: Communication styles, humour controversies and contested freedoms in contemporary Europe'. The project sets out to conduct an interdisciplinary analysis of the interrelatedness between digital humor and social conflict. The dataset contains data for 550 items of digitally mediated humor (e.g. online memes, cartoons, video's, posts) created in the context of specific cases of socio-political conflict in four European countries (i.e. Belgium, Belarus, Estonia and Poland). The dataset offers coding of linguistic markers such as genre, humor mechanisms and communication style as well as a mapping of the discourse which the humorous items spark on social media. Here, comments made as a reactions to the humor on social media platforms are coded for types of response (e.g. positive, negative, humorous, non-humorous) as well as the incidence of meta-comments (comments on comments) and other linguistic metrics for analysis (e.g. types of speech used in audience reactions). The data was coded indepentently by four researchers with a background in each respective country in 2023-2024. This dataset can be used, for example, to analyse audience reception of digitally mediated humor, or allow the (cross-national) analysis of the impact of different humoristic genres in digital public spheres.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations.
Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus.
Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm.
The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)
The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is developed for a research project “Humour Scandals (Hu-Sca): A cross-national analysis of humour controversies in Europe”, funded by KU Leuven, Una Europa Research Acceleration Fund. The data contains an overview of humor scandals i.e. public controversies originating from humor and dealing with the boundaries of transgressive humor in public debate and their reception in legacy media for eight European countries between 1990 and 2022. The data contains quantatively coded descriptive markers of each humor scandal (e.g. nature of norm transgression, actors involved, duration, timespan) as well as qualitative analysis of the way that the humor scandal was either justified or condemned in national legacy media. This data can be used for the analysis of the role of humor in socio-political conflict and the role of media in the creation and mediation of humor-related controversies.
The archived data consist of jokes, anecdotes and other humorous texts distributed through email messages. The researcher sent the request for email humour to the staff members of the Department of Sociology and Social Psychology at the University of Tampere, Finland, in February 2003. The staff members in their turn distributed the request further. Texts were received from university staff members and students as well as from outsiders. Data collection continued till the year 2004. The total number of email messages received was 217, some of which contained more than one joke or anecdote. The jokes/anecdotes were mostly in Finnish, but approximately 20% were in English. The themes of the email messages varied greatly. Many were connected to current events, for instance, the Iraq war, September 11 terrorist attacks in the USA, and the doping scandal of Finnish skiers in 2001. Other recurring themes included sexuality, gender and ethnicity stereotypes, and professional jokes. As is typical in email humor, the original creator of the jokes/anecdotes often remained unknown. The dataset is only available in the original languages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for one-million-reddit-jokes
Dataset Summary
This corpus contains a million posts from /r/jokes. Posts are annotated with their score.
Languages
Mainly English.
Dataset Structure
Data Instances
A data point is a Reddit post.
Data Fields
'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The funniness of joke is very subjective. Having more than 70,000 users rate jokes, can an algorithm be written to identify the universally funny joke?
The dataset is associated with the below research paper.
Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.
More information and datasets can be found at http://eigentaste.berkeley.edu/dataset/
Since funniness is a very subjective matter, it will be very interesting to see if data science can bring out the details on what makes something funny.
Statistical distribution of social media creators and influencers in the Humor category
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the task dataset for SemEval-2020 Task 7: Assessing Humor in Edited News Headlines.
The task’s dataset contains news headlines in which short edits were applied to make them funny, and the funniness of these edited headlines was rated using crowdsourcing. This task includes two subtasks, the first of which is to estimate the funniness of headlines on a humor scale in the interval 0-3. The second subtask is to predict, for a pair of edited versions of the same original headline, which is the funnier version.
CodaLab page hosting the competition: https://competitions.codalab.org/competitions/20970
Starter Github code (scripts for running baseline and evaluation): https://github.com/n-hossain/semeval-2020-task-7-humicroedit
Task mailing list:
Folders: - subtask-1: Dataset for the funniness regression subtask. - subtask-2: Dataset for the "Funnier of the Two" classification subtask.
Files: - {train, dev, test}.csv: the task's dataset including labels - train_funlines.csv: additional training data gathered from the FunLines competition (https://funlines.co) - baseline.zip: contains csv file which is the output of the BASELINE system. This is a template of the output format that can be submitted to CodaLab for scoring.
Reference
Please cite the task paper when using this dataset:
Nabil Hossain, John Krumm, Michael Gamon and Henry Kautz. 2020. Semeval-2020 Task 7: Assessing Humor in Edited News Headlines. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2020).
BIBTEX: @InProceedings{hossainSemEval2020Task7, author = {Hossain, Nabil and Krumm, John and Gamon, Michael and Kautz,Henry}, title = {SemEval-2020 {T}ask 7: {A}ssessing Humor in Edited News Headlines}, booktitle = {Proceedings of the 14th International Workshop on Semantic Evaluation ({S}em{E}val-2020)}, address = {Barcelona, Spain}, year = {2020}}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States Imports: Festve, excl Chrtms;Crnivl Magic Trk Joke Art;Pt& Accs data was reported at 98.281 USD mn in Jan 2025. This records an increase from the previous number of 80.876 USD mn for Dec 2024. United States Imports: Festve, excl Chrtms;Crnivl Magic Trk Joke Art;Pt& Accs data is updated monthly, averaging 52.122 USD mn from Jan 2002 (Median) to Jan 2025, with 277 observations. The data reached an all-time high of 418.819 USD mn in Jul 2022 and a record low of 10.093 USD mn in Mar 2002. United States Imports: Festve, excl Chrtms;Crnivl Magic Trk Joke Art;Pt& Accs data remains active status in CEIC and is reported by U.S. Census Bureau. The data is categorized under Global Database’s United States – Table US.JA136: Imports: by Commodity: 6 Digit HS Code: HS 85 to 99.
Quantitative data concerning self-repair jokes in the comedies by Plautus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States Exports: Festve, excl Chrtms;Crnivl Magic Trk Joke Art;Pt&Accs data was reported at 8.068 USD mn in Jan 2025. This records an increase from the previous number of 6.434 USD mn for Dec 2024. United States Exports: Festve, excl Chrtms;Crnivl Magic Trk Joke Art;Pt&Accs data is updated monthly, averaging 8.757 USD mn from Jan 2002 (Median) to Jan 2025, with 277 observations. The data reached an all-time high of 27.291 USD mn in Sep 2011 and a record low of 2.349 USD mn in Jan 2003. United States Exports: Festve, excl Chrtms;Crnivl Magic Trk Joke Art;Pt&Accs data remains active status in CEIC and is reported by U.S. Census Bureau. The data is categorized under Global Database’s United States – Table US.JA027: Exports: by Commodity: 6 Digit HS Code: HS 85 to 98.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objectives: To gather pilot data on the use of humor in the raising of children. Methodology: We developed and field-tested a 10-item survey to measure people’s experiences being raised with humor, and their views regarding humor as a parenting tool. Responses were aggregated into Disagree, Indeterminate, and Agree, and analyzed using standard statistical methods. Results: Of the 312 respondents, most identified as male (63.6%) and white (76.6%); and 11.3% reported being 18-25 years old, 49.4% 26-35 years old, and 39.4% 36-45 years old. The majority reported that: the people who raised them used humor in their parenting (55.2%); humor could be an effective parenting tool (71.8%); humor as a parenting tool has more potential benefit than harm (63.3%); they either use or plan to use humor in parenting their own children (61.8%); and they would value a course on how to utilize humor in parenting (69.7%). Conclusions: In this pilot study, respondents of child-bearing/rearing age reported positive views about humor as a parenting tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document outlines the scope and key features of the 'Humour Practices and European Attitudes Toward Democracy' dataset, including the procedure followed for its production.
The database collects and classifies relevant studies on humour practices, attitudes toward democracy, and modes of civic engagement in the six countries of the DELIAH consortium: Belgium, Estonia, Germany, the Netherlands, Slovakia, and Spain.
The dataset pursues two goals:
1) it contributes to a subsequent meta-analysis of humour studies, democratic participation, and civic engagement in online and offline spaces across Europe, which will be carried out by the DELIAH consortium, and
2) it serves as a collective resource for additional DELIAH project tasks, including the design of focus groups and surveys.
More broadly, the dataset has also been designed to appeal to scholars outside of the DELIAH consortium who work at the intersection of humour and democracy.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset was created by Nikuson
Released under ODC Attribution License (ODC-By)
RwanAshraf/humor-labeled-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the data sets for a study on audience segmentation to predict receptivity to humorous persuasive messages
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.