4 datasets found
  1. AI-generated fake podcasts (mostly via NotebookLM)

    • kaggle.com
    zip
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). AI-generated fake podcasts (mostly via NotebookLM) [Dataset]. https://www.kaggle.com/datasets/listennotes/ai-generated-fake-podcasts-spams/code
    Explore at:
    zip(1788541 bytes)Available download formats
    Dataset updated
    Nov 5, 2025
    Dataset authored and provided by
    Listen Notes
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Podcasting is a unique space where people can share their voices, ideas, and stories freely. Unlike platforms controlled by a single company (like YouTube or Instagram), podcasting supports true freedom of expression. However, this openness is now being threatened by AI tools, such as Notebook LM, which make it easy to produce fake, low-quality podcasts. Unfortunately, many of these AI-generated shows are created by spammers, scammers, or blackhat SEOs, and they are harming both listeners and genuine podcast creators.

    At Listen Notes, the leading podcast search engine and podcast API, we believe that creating a quality podcast takes real effort. Listeners can tell when a show has been crafted with care, and that’s why we are committed to stopping the spread of fake, AI-generated podcasts on our platform.

    This dataset represents a small subset of AI-generated fake podcasts that were flagged during attempts to add them to the Listen Notes podcast database. These "podcasts" were predominantly created using Notebook LM and are not designed for human consumption.

    The goal of sharing this dataset is to support the AI community in developing more effective tools to combat spam. While it may not be possible to eliminate spam entirely, we can work together to minimize its impact and contribute to making the digital world a better place.

    If you're building a podcast app for discovering human-made shows, PodcastAPI.com is your best bet. Apple Podcasts and Spotify are increasingly flooded with AI-generated fakes.

  2. r

    Data from: Why people listen: Motivations and outcomes of podcast listening

    • researchdata.edu.au
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Stephanie Tobin; Dr Stephanie Tobin (2022). Why people listen: Motivations and outcomes of podcast listening [Dataset]. https://researchdata.edu.au/why-people-listen-podcast-listening/1944842
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Queensland University of Technology
    Authors
    Dr Stephanie Tobin; Dr Stephanie Tobin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jul 31, 2019 - Aug 3, 2019
    Description

    This SPSS dataset is from a 2019 survey conducted via . There are 323 participants in the file, 306 with complete data for the key measures. Measures include the Big Five Inventory, the Interest/Deprivation Curiosity Scale, the Need for Cognition Scale, the Need to Belong Scale, the Basic Psychological Need Satisfaction Scale, the General Belongingness Scale, the Meaning in Life Questionnaire, the Mindful Attention Awareness Scale, the Smartphone Addiction Scale, and some questions about listening to podcasts.

    In relation to podcasts, participants were first asked if they had ever listened to a podcast. Those who said yes (N = 240) were asked questions related to amount of listening, categories and format of podcasts, setting of listening, device used, social engagement around podcasts, and parasocial relationships with their favourite podcast host. Participants also indicated their age, gender, and country of residence.

    The datafile contains item ratings and scale scores for all measures. Item wording and response labels are provided in the variable view tab of the downloaded file. Other files available on the OSF site include a syntax file related to the analyses reported in a published paper and a copy of the survey.

  3. Stuttering Events in Podcasts Dataset (SEP-28k)

    • kaggle.com
    zip
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Basak (2022). Stuttering Events in Podcasts Dataset (SEP-28k) [Dataset]. https://www.kaggle.com/datasets/ikrbasak/sep-28k/code
    Explore at:
    zip(2330237947 bytes)Available download formats
    Dataset updated
    Jun 21, 2022
    Authors
    Krishna Basak
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The SEP-28k dataset contains stuttering event annotations for approximately 28,000 3-second clips. In addition we include stutter event annotations for about 4,000 3-second clips from the FluencyBank dataset. Audio files are not part of this released dataset but may be downloaded using URLs provided in the *_episodes.csv files. Original copyright remains with the podcast owners.

    Annotation Descriptions

    Each 3-second clip was annotated with the following labels by three annotators who were not clinicians but did have training on how to identify each type of stuttering event. Label files contain counts (out of three) corresponding to how many reviewers selected a given label. Multiple labels may be selected for a given clip.

    Stuttering event labels:

    • Prolongation: Elongated syllable (e.g., M[mmm]ommy)
    • Block: Gasps for air or stuttered pauses
    • Sound Repetition: Repeated syllables (e.g., I [pr-pr-pr-]prepared dinner)
    • Word Repetition: The same word or phrase is repeated (e.g., I made [made] dinner)
    • No Stuttered Words: Confirmation that none of the above is true.
    • Interjection: Common filler words such as "um" or "uh" or person-specific filler words that individuals use to cope with their stutter (e.g., some users frequently say "you know" as a filler).

    Additional labels:

    • Unsure: An annotator selects this if they are not confident in their labeling.
    • Poor Audio Quality: It is difficult to understand due to, for example, microphone quality.
    • Difficult To Understand: It is difficult to understand the speech.
    • Natural Pause: There is a pause in speech that is not considered a block or other disfluency.
    • Music: There is background music playing (only in SEP-28k)
    • No Speech: There is no speech in this clip. It is either silent or there is just background noise.

    Data Files:

    • Links to media files: Podcast/FluencyBank names, audio urls, and keycodes used with the annotation labels (SEP-28k_episodes.csv and fluencybank_episodes.csv).
    • Annotations: Clips used within each audio file with corresponding start/stop time and fluency labels (SEP-28k_labels.csv and fluencybank_labels.csv).

    If you find the SEP-28k dataset or this code useful in your research, please cite the following paper: @inproceedings{stuttering-event-detection, title = {Sep-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter}, author = {Colin Lea and Vikramjit Mitra and Aparna Joshi and Sachin Kajarekar and Jeffrey Bigham}, year = {2021}, URL = {https://arxiv.org/pdf/2102.12394.pdf} } The SEP-28k dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.

    This dataset is provided by Apple and does not, by any mean, belong to me. Find this dataset in GitHub.

  4. Podcast metadata by full-text keyword search

    • listennotes.com
    csv
    Updated Sep 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes, Inc. (2020). Podcast metadata by full-text keyword search [Dataset]. https://www.listennotes.com/podcast-datasets/keyword/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 21, 2020
    Dataset provided by
    Listen Notes
    Authors
    Listen Notes, Inc.
    License

    https://www.listennotes.com/podcast-datasets/keyword/#termshttps://www.listennotes.com/podcast-datasets/keyword/#terms

    Description

    Batch export all podcasts or episodes by full-text keyword search, e.g., people, brands, topics...

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Listen Notes (2025). AI-generated fake podcasts (mostly via NotebookLM) [Dataset]. https://www.kaggle.com/datasets/listennotes/ai-generated-fake-podcasts-spams/code
Organization logo

AI-generated fake podcasts (mostly via NotebookLM)

A lot of fake podcasts generated by AI tools, primarily for spamming purposes

Explore at:
zip(1788541 bytes)Available download formats
Dataset updated
Nov 5, 2025
Dataset authored and provided by
Listen Notes
License

http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

Description

Podcasting is a unique space where people can share their voices, ideas, and stories freely. Unlike platforms controlled by a single company (like YouTube or Instagram), podcasting supports true freedom of expression. However, this openness is now being threatened by AI tools, such as Notebook LM, which make it easy to produce fake, low-quality podcasts. Unfortunately, many of these AI-generated shows are created by spammers, scammers, or blackhat SEOs, and they are harming both listeners and genuine podcast creators.

At Listen Notes, the leading podcast search engine and podcast API, we believe that creating a quality podcast takes real effort. Listeners can tell when a show has been crafted with care, and that’s why we are committed to stopping the spread of fake, AI-generated podcasts on our platform.

This dataset represents a small subset of AI-generated fake podcasts that were flagged during attempts to add them to the Listen Notes podcast database. These "podcasts" were predominantly created using Notebook LM and are not designed for human consumption.

The goal of sharing this dataset is to support the AI community in developing more effective tools to combat spam. While it may not be possible to eliminate spam entirely, we can work together to minimize its impact and contribute to making the digital world a better place.

If you're building a podcast app for discovering human-made shows, PodcastAPI.com is your best bet. Apple Podcasts and Spotify are increasingly flooded with AI-generated fakes.

Search
Clear search
Close search
Google apps
Main menu