100+ datasets found
  1. h

    short-jokes

    • huggingface.co
    • kaggle.com
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fraser Greenlee (2021). short-jokes [Dataset]. https://huggingface.co/datasets/Fraser/short-jokes
    Explore at:
    Dataset updated
    Mar 9, 2021
    Authors
    Fraser Greenlee
    Description

    Copy of Kaggle dataset, adding to Huggingface for ease of use.

    Description from Kaggle:

    Context

    Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.

    Visit my Github repository for more information regarding collection of data and the scripts used.

    Content

    This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.

    Disclaimer

    It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.

  2. h

    dadjokes

    • huggingface.co
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Grebennikov (2023). dadjokes [Dataset]. https://huggingface.co/datasets/shuttie/dadjokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2023
    Authors
    Roman Grebennikov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dad Jokes dataset

    This dataset is generated from the Kaggle Reddit Dad Jokes by Oktay Ozturk, with the following modifications:

    Only jokes with 5+ votes were sampled. Less upvoted jokes are too cringe. With a set of heuristics, each joke was split into two parts: base and the punchline.

      Format
    

    The dataset is formatted as a CSV, and is split into train/test parts:

    train: 52000 samples test: 1400 samples

    "question","response" "I asked my priest how he gets holy… See the full description on the dataset page: https://huggingface.co/datasets/shuttie/dadjokes.

  3. Jester Jokes Dataset v4

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Amine DHIAB (2023). Jester Jokes Dataset v4 [Dataset]. https://www.kaggle.com/datasets/mohamedaminedhiab/jester-jokes-dataset-v4
    Explore at:
    zip(1419440 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Mohamed Amine DHIAB
    Description

    Dataset

    This dataset was created by Mohamed Amine DHIAB

    Contents

  4. h

    short_jokes

    • huggingface.co
    Updated Feb 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yuvraj sharma (2024). short_jokes [Dataset]. https://huggingface.co/datasets/ysharma/short_jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2024
    Authors
    yuvraj sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.

  5. P

    Jester (Jokes) Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenneth Y. Goldberg; Theresa Roeder; Dhruv Gupta; Chris Perkins, Jester (Jokes) Dataset [Dataset]. https://paperswithcode.com/dataset/jester
    Explore at:
    Authors
    Kenneth Y. Goldberg; Theresa Roeder; Dhruv Gupta; Chris Perkins
    Description

    6.5 million anonymous ratings of jokes by users of the Jester Joke Recommender System.

  6. h

    jokes

    • huggingface.co
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kid (2023). jokes [Dataset]. https://huggingface.co/datasets/Amirkid/jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2023
    Authors
    Kid
    License

    https://choosealicense.com/licenses/creativeml-openrail-m/https://choosealicense.com/licenses/creativeml-openrail-m/

    Description

    Amirkid/jokes dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. 100500+ Reddit jokes

    • kaggle.com
    Updated Jan 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergei Averkiev (2020). 100500+ Reddit jokes [Dataset]. https://www.kaggle.com/averkij/reddit-jokes-dataset/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2020
    Dataset provided by
    Kaggle
    Authors
    Sergei Averkiev
    Description

    Introduction

    Dataset full of jokes ranged by score.

    Content

    Some examples:

    — Why did the invisible man quit his new job? — He just couldn't see himself doing it.

    — A man sees a pregnant woman laughing — He asks the woman, she replies "Nothing, it's an inside joke!"

    Sources

    https://github.com/taivop/joke-dataset

    Author

    Pungas, Taivo

  8. jokes dataset

    • kaggle.com
    zip
    Updated Jan 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaroslav-siberia (2022). jokes dataset [Dataset]. https://www.kaggle.com/datasets/yaroslav62/jokes-dataset/data
    Explore at:
    zip(7775133 bytes)Available download formats
    Dataset updated
    Jan 18, 2022
    Authors
    Yaroslav-siberia
    Description

    Dataset

    This dataset was created by Yaroslav-siberia

    Contents

  9. Joke Dataset

    • kaggle.com
    Updated Feb 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brendan Finan (2018). Joke Dataset [Dataset]. https://www.kaggle.com/datasets/bfinan/jokes-question-and-answer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Brendan Finan
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    My goal with this dataset is to create the largest and most organized dataset of jokes.

    Tools for this dataset are on my Github

    Content

    • Jokes reduced to only the Question and the Answer.
    • Duplicates NOT removed
    • Offensive jokes NOT removed

    Acknowledgements

    Question-Answer Jokes by Jiri Roznovjak

    Short Jokes by Abhinav Moudgil

    Inspiration

    Humor is one of the most difficult domains of natural language processing.

    Contribute

    If you want to help rate the jokes based on funniness and/or vulgarity, download the .csv and make new column(s) with your rating(s). Email that to bfinan@iastate.edu, and I'll add your ratings as part of the dataset.

  10. h

    hailuo-ai-jokes

    • huggingface.co
    Updated Feb 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Peterson (2025). hailuo-ai-jokes [Dataset]. https://huggingface.co/datasets/unlimitedbytes/hailuo-ai-jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2025
    Authors
    Christian Peterson
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Hailuo AI Jokes Dataset 🎤

    A curated collection of high-quality voice recordings with corresponding transcriptions and phoneme analysis. This dataset is designed for speech recognition, text-to-speech, and voice analysis tasks.

      🎙️ Dataset Content
    

    The dataset contains a diverse set of synthetic voice recordings generated by Hailuo AI Audio. The texts are sourced from a variety of public domain jokes and humorous anecdotes. Each audio sample is accompanied by the… See the full description on the dataset page: https://huggingface.co/datasets/unlimitedbytes/hailuo-ai-jokes.

  11. Jester 1.7M jokes ratings dataset

    • kaggle.com
    Updated Nov 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikashrajluhaniwal (2019). Jester 1.7M jokes ratings dataset [Dataset]. https://www.kaggle.com/vikashrajluhaniwal/jester-17m-jokes-ratings-dataset/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    vikashrajluhaniwal
    Description

    Context

    Can an automated system recommend a funny joke? Jester is an online joke recommender system developed by Ken Goldberg and the team at UC Berkeley. Users are presented jokes through an HTML client interface and allowed to rate jokes. Once a user rates all jokes in the gauge set, the system recommends new jokes to the user.

    Content

    • The dataset contains over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 59,132 users.
    • The dataset is collected between November 2006 - May 2009.
    • The complete dataset has two CSV files:-
    • jester_ratings.csv: Each row is formatted as [User ID] [Item ID] [Rating]
    • jester_items.csv: Maps item ID's to jokes
    • The ratings are real values ranging from -10.00 to +10.00.
    • As of May 2009, the jokes {7, 8, 13, 15, 16, 17, 18, 19} are the "gauge set".

    Acknowledgements

    Inspiration

    Refer to the below research paper to have more ideas about the usefulness of the dataset.

    Eigentaste: A Constant Time Collaborative Filtering Algorithm

  12. Russian Jokes

    • kaggle.com
    Updated Nov 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantin Albul (2021). Russian Jokes [Dataset]. https://www.kaggle.com/konstantinalbul/russian-jokes/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Konstantin Albul
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Russia
    Description

    Context

    This dataset is a good way to practice in text classification. Try to predict the theme of the joke from the text. Or define more rated joke.

    Links

  13. Email Jokes 1998-2004

    • services.fsd.tuni.fi
    • datacatalogue.cessda.eu
    zip
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aro, Jari (2025). Email Jokes 1998-2004 [Dataset]. http://doi.org/10.60686/t-fsd1271
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Finnish Social Science Data Archive
    Authors
    Aro, Jari
    Description

    The archived data consist of jokes, anecdotes and other humorous texts distributed through email messages. The researcher sent the request for email humour to the staff members of the Department of Sociology and Social Psychology at the University of Tampere, Finland, in February 2003. The staff members in their turn distributed the request further. Texts were received from university staff members and students as well as from outsiders. Data collection continued till the year 2004. The total number of email messages received was 217, some of which contained more than one joke or anecdote. The jokes/anecdotes were mostly in Finnish, but approximately 20% were in English. The themes of the email messages varied greatly. Many were connected to current events, for instance, the Iraq war, September 11 terrorist attacks in the USA, and the doping scandal of Finnish skiers in 2001. Other recurring themes included sexuality, gender and ethnicity stereotypes, and professional jokes. As is typical in email humor, the original creator of the jokes/anecdotes often remained unknown. The dataset is only available in the original languages.

  14. Jester Jokes Dataset

    • kaggle.com
    Updated May 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sameer Dev (2019). Jester Jokes Dataset [Dataset]. https://www.kaggle.com/sameerdev7/joke-rating/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sameer Dev
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    This dataset contains over a million jokes and the rating given to each joke by users

  15. w

    Books called Joke-tionary jokes : more than 444 jokes for kids!

    • workwithdata.com
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Books called Joke-tionary jokes : more than 444 jokes for kids! [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Joke-tionary+jokes+%3A+more+than+444+jokes+for+kids%21
    Explore at:
    Dataset updated
    Oct 11, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books and is filtered where the book is Joke-tionary jokes : more than 444 jokes for kids!. It has 7 columns such as author, BNB id, book, book publisher, and ISBN. The data is ordered by publication date (descending).

  16. d

    Corpus of daily jokes from the 24ur.com portal Šale24 1.0 - Dataset - B2FIND...

    • b2find.dkrz.de
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Corpus of daily jokes from the 24ur.com portal Šale24 1.0 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/c29f095d-fa29-59aa-b494-c85caa0622c4
    Explore at:
    Dataset updated
    Jan 15, 2025
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)

  17. w

    Books called Best teenage jokes

    • workwithdata.com
    Updated Oct 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Books called Best teenage jokes [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Best+teenage+jokes
    Explore at:
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books and is filtered where the book is Best teenage jokes. It has 7 columns such as book, author, ISBN, BNB id, and language. The data is ordered by publication date (descending).

  18. Dataset of Russian jokes

    • kaggle.com
    zip
    Updated Feb 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirill Ovcharenko (2023). Dataset of Russian jokes [Dataset]. https://www.kaggle.com/datasets/kovcharenko51/dataset-of-russian-jokes
    Explore at:
    zip(1569809 bytes)Available download formats
    Dataset updated
    Feb 3, 2023
    Authors
    Kirill Ovcharenko
    Area covered
    Russia
    Description

    Dataset

    This dataset was created by Kirill Ovcharenko

    Contents

  19. h

    programming-jokes-dataset

    • huggingface.co
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asfandyar Azhar (2024). programming-jokes-dataset [Dataset]. https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2024
    Authors
    Asfandyar Azhar
    Description

    Programming Jokes Dataset

      Dataset Summary
    

    This dataset contains programming-related jokes scraped from the website Punny Funny. The jokes are organized into different categories based on the structure of the original webpage. The dataset is intended for use in natural language processing tasks, such as fine-tuning language models to generate humor or analyze textual content in the programming domain. Number of Jokes: [220]

      Usage
    

    This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset.

  20. h

    jokes-dataset

    • huggingface.co
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rayhana Rafiai (2025). jokes-dataset [Dataset]. https://huggingface.co/datasets/rayhanti/jokes-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2025
    Authors
    Rayhana Rafiai
    Description

    rayhanti/jokes-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Fraser Greenlee (2021). short-jokes [Dataset]. https://huggingface.co/datasets/Fraser/short-jokes

short-jokes

Fraser/short-jokes

Explore at:
Dataset updated
Mar 9, 2021
Authors
Fraser Greenlee
Description

Copy of Kaggle dataset, adding to Huggingface for ease of use.

Description from Kaggle:

Context

Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.

Visit my Github repository for more information regarding collection of data and the scripts used.

Content

This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.

Disclaimer

It has been attempted to keep the jokes as clean as possible. Since the data has been collected by scraping websites, it is possible that there may be a few jokes that are inappropriate or offensive to some people.

Search
Clear search
Close search
Google apps
Main menu