100+ datasets found
  1. h

    one-million-reddit-jokes

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SocialGrep (2021). one-million-reddit-jokes [Dataset]. https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Authors
    SocialGrep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for one-million-reddit-jokes

      Dataset Summary
    

    This corpus contains a million posts from /r/jokes. Posts are annotated with their score.

      Languages
    

    Mainly English.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    A data point is a Reddit post.

      Data Fields
    

    'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.

  2. h

    programming-jokes-dataset

    • huggingface.co
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asfandyar Azhar (2024). programming-jokes-dataset [Dataset]. https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2024
    Authors
    Asfandyar Azhar
    Description

    Programming Jokes Dataset

      Dataset Summary
    

    This dataset contains programming-related jokes scraped from the website Punny Funny. The jokes are organized into different categories based on the structure of the original webpage. The dataset is intended for use in natural language processing tasks, such as fine-tuning language models to generate humor or analyze textual content in the programming domain. Number of Jokes: [220]

      Usage
    

    This dataset is suitable for… See the full description on the dataset page: https://huggingface.co/datasets/asfandyarazhar/programming-jokes-dataset.

  3. h

    short_jokes

    • huggingface.co
    Updated Feb 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yuvraj sharma (2024). short_jokes [Dataset]. https://huggingface.co/datasets/ysharma/short_jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2024
    Authors
    yuvraj sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.

  4. h

    jokes-dataset

    • huggingface.co
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rayhana Rafiai (2025). jokes-dataset [Dataset]. https://huggingface.co/datasets/rayhanti/jokes-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2025
    Authors
    Rayhana Rafiai
    Description

    rayhanti/jokes-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. w

    Dataset of books called Joke-tionary jokes : more than 444 jokes for kids!

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Joke-tionary jokes : more than 444 jokes for kids! [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Joke-tionary+jokes+%3A+more+than+444+jokes+for+kids%21
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Joke-tionary jokes : more than 444 jokes for kids!. It features 7 columns including author, publication date, language, and book publisher.

  6. Email Jokes 1998-2004

    • services.fsd.tuni.fi
    zip
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aro, Jari (2025). Email Jokes 1998-2004 [Dataset]. http://doi.org/10.60686/t-fsd1271
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Finnish Social Science Data Archive
    Authors
    Aro, Jari
    Description

    The archived data consist of jokes, anecdotes and other humorous texts distributed through email messages. The researcher sent the request for email humour to the staff members of the Department of Sociology and Social Psychology at the University of Tampere, Finland, in February 2003. The staff members in their turn distributed the request further. Texts were received from university staff members and students as well as from outsiders. Data collection continued till the year 2004. The total number of email messages received was 217, some of which contained more than one joke or anecdote. The jokes/anecdotes were mostly in Finnish, but approximately 20% were in English. The themes of the email messages varied greatly. Many were connected to current events, for instance, the Iraq war, September 11 terrorist attacks in the USA, and the doping scandal of Finnish skiers in 2001. Other recurring themes included sexuality, gender and ethnicity stereotypes, and professional jokes. As is typical in email humor, the original creator of the jokes/anecdotes often remained unknown. The dataset is only available in the original languages.

  7. Question-Answer Jokes

    • kaggle.com
    Updated Jan 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiri Roznovjak (2017). Question-Answer Jokes [Dataset]. https://www.kaggle.com/jiriroz/qa-jokes/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 5, 2017
    Dataset provided by
    Kaggle
    Authors
    Jiri Roznovjak
    Description

    This dataset contains 38,269 jokes of the question-answer form, obtained from the r/Jokes subreddit. The dataset contains a csv file, where a row contains a question ("Why did the chicken cross the road"), the corresponding answer ("To get to the other side") and a unique ID.

    The data comes from the end of 2016 all the way to 2008. The entries with a higher ID correspond to the ones submitted earlier.

    An example of what one might do with the data is build a sequence-to-sequence model where the input is a question and the output is an answer. Then, given a question, the model should generate a funny answer. This is what I did as the final project for my fall 2016 machine learning class. The project page can be viewed here.

    Disclaimer: The dataset contains jokes that some may find inappropriate.

    License

    Released under reddit's API terms

  8. Jester Collaborative Filtering Dataset

    • kaggle.com
    Updated Jun 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aakaash Jois (2017). Jester Collaborative Filtering Dataset [Dataset]. https://www.kaggle.com/aakaashjois/jester-collaborative-filtering-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aakaash Jois
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The funniness of joke is very subjective. Having more than 70,000 users rate jokes, can an algorithm be written to identify the universally funny joke?

    Content

    • The data file are in .csv format.
    • The complete dataset is 100 rows and 73422 columns.
    • The complete dataset is split into 3 .csv files.
    • JokeText.csv contains the Id of the joke and the complete joke string.
    • UserRatings1.csv contains the ratings provided by the first 36710 users.
    • UserRatings2.csv contains the ratings provided by the last 36711 users.
    • The dataset is arranged such that the initial users have rated higher number of jokes than the later users.
    • The rating is a real value between -10.0 and +10.0.
    • The empty values indicate that the user has not provided any rating for that particular joke.

    Acknowledgements

    The dataset is associated with the below research paper.

    Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

    More information and datasets can be found at http://eigentaste.berkeley.edu/dataset/

    Inspiration

    Since funniness is a very subjective matter, it will be very interesting to see if data science can bring out the details on what makes something funny.

  9. w

    Dataset of books called Jokes, jests and jollies

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Jokes, jests and jollies [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Jokes%2C+jests+and+jollies
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Jokes, jests and jollies. It features 7 columns including author, publication date, language, and book publisher.

  10. Rated short jokes

    • kaggle.com
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacopo Le Pera (2025). Rated short jokes [Dataset]. https://www.kaggle.com/datasets/jacopolepera/rated-short-jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jacopo Le Pera
    Description

    Dataset

    This dataset was created by Jacopo Le Pera

    Contents

  11. w

    Dataset of books called Dirty jokes every man should know

    • workwithdata.com
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Dirty jokes every man should know [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Dirty+jokes+every+man+should+know
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Dirty jokes every man should know. It features 7 columns including author, publication date, language, and book publisher.

  12. e

    Corpus of daily jokes from the 24ur.com portal Šale24 1.0 - Dataset - B2FIND...

    • b2find.eudat.eu
    Updated Jul 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Corpus of daily jokes from the 24ur.com portal Šale24 1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/c29f095d-fa29-59aa-b494-c85caa0622c4
    Explore at:
    Dataset updated
    Jul 28, 2025
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)

  13. w

    Dataset of books series that contain Jokes and fun

    • workwithdata.com
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of books series that contain Jokes and fun [Dataset]. https://www.workwithdata.com/datasets/book-series?f=1&fcol0=j0-book&fop0=%3D&fval0=Jokes+and+fun&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book series. It has 1 row and is filtered where the books is Jokes and fun. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  14. w

    Dataset of books called Monster jokes

    • workwithdata.com
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Monster jokes [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Monster+jokes
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 3 rows and is filtered where the book is Monster jokes. It features 7 columns including author, publication date, language, and book publisher.

  15. Offense Classification Jokes

    • kaggle.com
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avneet Singh (2024). Offense Classification Jokes [Dataset]. https://www.kaggle.com/datasets/avneets2103/offense-classification-jokes/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Avneet Singh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Avneet Singh

    Released under Apache 2.0

    Contents

  16. E

    Data from: Corpus of daily jokes from the 24ur.com portal Šale24 1.0

    • live.european-language-grid.eu
    binary format
    Updated Oct 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Corpus of daily jokes from the 24ur.com portal Šale24 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23698
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Oct 2, 2024
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations.

    Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus.

    Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm.

    The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date)

    The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities.

  17. Million Jokes

    • kaggle.com
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avneet Singh (2024). Million Jokes [Dataset]. https://www.kaggle.com/datasets/avneets2103/million-jokes/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Avneet Singh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Avneet Singh

    Released under Apache 2.0

    Contents

  18. s

    Plaintext Jokes

    • marketplace.sshopencloud.eu
    Updated Sep 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Plaintext Jokes [Dataset]. https://marketplace.sshopencloud.eu/dataset/nCeh4z
    Explore at:
    Dataset updated
    Sep 10, 2018
    Description

    Approximately 208,000 jokes scraped from various websites

  19. ChatGPT Tell Me A Joke

    • kaggle.com
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naveen Rajan (2023). ChatGPT Tell Me A Joke [Dataset]. https://www.kaggle.com/datasets/navirocker/chatgpt-tell-me-a-joke
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Dataset provided by
    Kaggle
    Authors
    Naveen Rajan
    License

    http://www.gnu.org/licenses/fdl-1.3.htmlhttp://www.gnu.org/licenses/fdl-1.3.html

    Description

    Generated by ChatGPT, this dataset is a compilation of humorous jokes and their corresponding responses. With a wide range of topics, from animals to everyday situations, these light-hearted interactions showcase the playful side of AI. Whether you're in need of a quick chuckle or a mood boost, let ChatGPT's witty banter provide the laughter you seek.

  20. h

    Bangla_jokes

    • huggingface.co
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    adnan chowdhury (2025). Bangla_jokes [Dataset]. https://huggingface.co/datasets/adnan1837/Bangla_jokes
    Explore at:
    Dataset updated
    Apr 8, 2025
    Authors
    adnan chowdhury
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Bangla Jokes Dataset

    The Bangla Jokes Dataset is a collection of humorous text samples written in Bengali (Bangla). This dataset is intended for NLP research and model training, especially in the area of Bangla-language humor generation, sentiment, or cultural studies. It is one of the first attempts to gather a sizable dataset of jokes in Bangla for open-source use.

      Dataset Details
    

    Curated by: Adnan1837 Funded by: No one Shared by: Adnan, Md.… See the full description on the dataset page: https://huggingface.co/datasets/adnan1837/Bangla_jokes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SocialGrep (2021). one-million-reddit-jokes [Dataset]. https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes

one-million-reddit-jokes

SocialGrep/one-million-reddit-jokes

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2021
Authors
SocialGrep
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Card for one-million-reddit-jokes

  Dataset Summary

This corpus contains a million posts from /r/jokes. Posts are annotated with their score.

  Languages

Mainly English.

  Dataset Structure





  Data Instances

A data point is a Reddit post.

  Data Fields

'type': the type of the data point. Can be 'post' or 'comment'. 'id': the base-36 Reddit ID of the data point. Unique when combined with type. 'subreddit.id': the base-36 Reddit ID… See the full description on the dataset page: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes.

Search
Clear search
Close search
Google apps
Main menu