22 datasets found
  1. P

    Spotify Podcast Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones (2024). Spotify Podcast Dataset [Dataset]. https://paperswithcode.com/dataset/spotify-podcast
    Explore at:
    Dataset updated
    Oct 16, 2024
    Authors
    Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones
    Description

    A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.

  2. Data from: PodcastMix - a dataset for separating music and speech in...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Schmidt; Jordi Pons; Marius Miron; Nicolas Schmidt; Jordi Pons; Marius Miron (2022). PodcastMix - a dataset for separating music and speech in podcasts [Dataset]. http://doi.org/10.5281/zenodo.5597047
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicolas Schmidt; Jordi Pons; Marius Miron; Nicolas Schmidt; Jordi Pons; Marius Miron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: due to zenodo limitations here we host solely the metadata. the whole dataset can be found at: https://drive.google.com/drive/u/0/folders/1tpg9WXkl4L0zU84AwLQjrFqnP-jw1t7z

    We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. It contains audio files at 44.1kHz and the corresponding metadata. For further details check the following paper and the associated GitHub repository:

    This dataset contains four parts. Due to zenodo file size limitation we host the training dataset on google drive. We highlight the content of the zenodo archives within brackets:

    • [metadata] PodcastMix-synth train: large and diverse training set that is programatically generated (with a validation partition). The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.
    • [metadata] PodcastMix-synth test a programatically generated test set with reference stems to compute evaluation metrics. The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.
    • [audio and metadata] PodcastMix-real with-reference : a test set with real podcasts with reference stems to compute evaluation metrics. The podcasts are recorded by one of the authors and the source of the music is the FMA dataset.
    • [audio and metadata] PodcastMix-real no-reference: a test set with real podcasts with only the podcasts mixes for subjective evaluation. The podcasts are compiled from the internet.

    The training dataset, PodcastMix-synth may be found at our google drive repository: https://drive.google.com/drive/folders/1tpg9WXkl4L0zU84AwLQjrFqnP-jw1t7z?usp=sharing . The archive comprises 450GB of audio and metadata with the following structure:

    • [metadata and audio] PodcastMix-synth train: large and diverse training set that is programatically generated (with a validation partition). The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.
    • [metadata and audio] PodcastMix-synth test a programatically generated test set with reference stems to compute evaluation metrics. The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.

    Make sure you maintain the folder structure of the original dataset when you uncompress these files.


    This dataset is created by Nicolas Schmidt, Marius Miron, Music Technology Group - Universitat Pompeu Fabra (Barcelona) and Jordi Pons. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License (CC BY-SA 4.0).


    Please acknowledge PodcastMix in Academic Research. When the present dataset is used for academic research, we would highly appreciate if authors quote the following publications:

    • N. Schmidt, J. Pons, M. Miron, "PodcastMix - a dataset for separating music and speech in podcasts", Interspeech (2022)
    • N. Schmidt, "PodcastMix - a dataset for separating music and speech in podcasts", Masters thesis, MTG, UPF (2021) https://zenodo.org/record/5554790#.YXLHvNlByWA


    The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the UPF is not liable for, and expressly excludes, all liability for loss or damage however and whenever caused to anyone by any use of the dataset or any part of it.


    PURPOSES. The data is processed for the general purpose of carrying out research development and innovation studies, works or projects. In particular, but without limitation, the data is processed for the purpose of communicating with Licensee regarding any administrative and legal / judicial purposes.

  3. The "Podcast" ECoG dataset

    • openneuro.org
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaid Zada; Samuel A. Nastase; Bobbi Aubrey; Itamar Jalon; Ariel Goldstein; Sebastian Michelmann; Haocheng Wang; Liat Hasenfratz; Werner Doyle; Daniel Friedman; Patricia Dugan; Lucia Melloni; Sasha Devore; Orrin Devinsky; Adeen Flinker; Uri Hasson (2025). The "Podcast" ECoG dataset [Dataset]. http://doi.org/10.18112/openneuro.ds005574.v1.0.2
    Explore at:
    Dataset updated
    Feb 17, 2025
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Zaid Zada; Samuel A. Nastase; Bobbi Aubrey; Itamar Jalon; Ariel Goldstein; Sebastian Michelmann; Haocheng Wang; Liat Hasenfratz; Werner Doyle; Daniel Friedman; Patricia Dugan; Lucia Melloni; Sasha Devore; Orrin Devinsky; Adeen Flinker; Uri Hasson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The "Podcast" ECoG dataset for modeling neural activity during natural story listening.

    We introduce the “Podcast” electrocorticography (ECoG) dataset for modeling neural activity supporting natural narrative comprehension. This dataset combines the exceptional spatiotemporal resolution of human intracranial electrophysiology with a naturalistic experimental paradigm for language comprehension. In addition to the raw data, we provide a minimally preprocessed version in the high-gamma spectral band to showcase a simple pipeline and to make it easier to use. Furthermore, we include the auditory stimuli, an aligned word-level transcript, and linguistic features ranging from low-level acoustic properties to large language model (LLM) embeddings. We also include tutorials replicating previous findings and serve as a pedagogical resource and a springboard for new research. The dataset comprises 9 participants with 1,330 electrodes, including grid, depth, and strip electrodes. The participants listened to a 30-minute story with over 5,000 words. By using a natural story with high-fidelity, invasive neural recordings, this dataset offers a unique opportunity to investigate language comprehension.

  4. PodcastFillers

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Oct 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ge Zhu; Ge Zhu; Juan-Pablo Caceres; Justin Salamon; Juan-Pablo Caceres; Justin Salamon (2022). PodcastFillers [Dataset]. http://doi.org/10.5281/zenodo.7121457
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Oct 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ge Zhu; Ge Zhu; Juan-Pablo Caceres; Justin Salamon; Juan-Pablo Caceres; Justin Salamon
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    OVERVIEW:
    The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.

    The PodcastFillers dataset homepage: PodcastFillers.github.io
    The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils

    LICENSE:

    The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.

    Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.

    ## License for PodcastFillers Dataset metadata

    This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.

    1. GRANT OF LICENSE.
    1.1 Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution.
    1.2 You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
    1.3 For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
    2. OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
    3. DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
    4. LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
    5. TERM AND TERMINATION.
    5.1 The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2.
    5.2 Any breach of any material provision of this License will automatically terminate the rights granted herein.
    5.3 Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License.
    ## License for PodcastFillers Dataset audio files

    All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.

    ACKNOWLEDGEMENT:
    Please cite the following paper in work that makes use of this dataset:

    Filler Word Detection and Classification: A Dataset and Benchmark
    Ge Zhu, Juan-Pablo Caceres and Justin Salamon
    In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.

    Bibtex

    @inproceedings{Zhu:FillerWords:INTERSPEECH:22,
     title = {Filler Word Detection and Classification: A Dataset and Benchmark},
     booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)},
     address = {Incheon, Korea}, 
     month = {Sep.},
     url = {https://arxiv.org/abs/2203.15135},
     author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin},
     year = {2022},
    }

    ANNOTATIONS:
    The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events.
    Full label vocabulary
    Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).

    Fillers
    - Uh: 17,907
    - Um: 17,078
    - You know: 668
    - Other: 315
    - Like: 157

    Non-fillers
    - Words: 12,709
    - Repetitions: 9,024
    - Breath: 8,288
    - Laughter: 6,623
    - Music : 5,060
    - Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755
    - Noise : 2,735
    - Overlap (overlapping speakers): 1,484

    Total: 85,803
    Consolidated label vocabulary
    76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.

    - Words: 21,733
    - Uh: 17,907
    - Um: 17,078
    - Breath: 8,288
    - Laughter: 6,623
    - Music : 5,060

    - Total: 76,689

    The consolidated vocabulary was used to train FillerNet

    For a detailed description of how the dataset was created, please see our paper.
    Data Split for Machine Learning:
    To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.

    We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper

    AUDIO FILES:

    1. Full-length podcast episodes (MP3)
    199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.

    2. Pre-processed full-length podcast episodes (WAV)
    199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav

    3. Pre-processed WAV clips
    Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.

    The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra”

  5. P

    MSP-Podcast Dataset

    • paperswithcode.com
    Updated Feb 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). MSP-Podcast Dataset [Dataset]. https://paperswithcode.com/dataset/msp-podcast
    Explore at:
    Dataset updated
    Feb 11, 2021
    Description

    The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.7 of the corpus has 62,140 speaking turns (100hrs).

    Key features of this corpus:

    We download available audio recordings with common license. We only use the podcasts that have less restrictive licenses, so we can modify, sell and distribute the corpus (you can use it for commercial product!). Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions. We annotate categorical emotions and attribute based labels at the speaking turn label This is an ongoing effort, where we currently have 62,140 speaking turns (100h). We collect approximately 10,000-13,000 new speaking turns per year. Our goal is to reach 400 hours.

  6. Podcast Database - Complete Podcast Metadata, All Countries & Languages

    • datarade.ai
    .csv, .sql, .json
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-products/podcast-database-complete-podcast-metadata-all-countries-listen-notes
    Explore at:
    .csv, .sql, .jsonAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Listen Notes
    Area covered
    Zambia, Turkey, Colombia, Indonesia, Bosnia and Herzegovina, Guinea-Bissau, Slovenia, Anguilla, Gibraltar, Iran (Islamic Republic of)
    Description

    == Quick facts ==

    The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

    == Use Cases ==

    AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

    == Data Attributes ==

    See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

    How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

    == Custom Offers ==

    We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

    We also provide a RESTful API at PodcastAPI.com

    Contact us: hello@listennotes.com

    == Need Help? ==

    If you have any questions about our products, feel free to reach out hello@listennotes.com

    == About Listen Notes, Inc. ==

    Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.

  7. This American Life Podcast Transcript Dataset

    • kaggle.com
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). This American Life Podcast Transcript Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/this-american-life-podcast-transcript-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    This American Life Podcast Transcript Dataset

    This American Life Podcast Transcripts with Speaker Information and Timestamps

    By Chris Jewell [source]

    About this dataset

    This dataset provides a comprehensive collection of the transcripts for every episode of the popular podcast This American Life since its inception in November 1995. The dataset includes detailed speaker information, timestamps, and act or segment names for each line spoken throughout the episodes.

    With a focus on web scraping using Python and utilizing the powerful BeautifulSoup library, this dataset was meticulously created to offer researchers and enthusiasts an invaluable resource for various analytical purposes. Whether it be sentiment analysis, linguistic studies, or other forms of textual analysis, these transcripts provide a rich mine of data waiting to be explored.

    The informative columns in this dataset include episode number, radio date (when each episode was aired), title (of each episode), act name (or segment title within an episode), line text (the spoken text by speakers), and speaker class (categorizing speakers into different roles such as host, guest, narrator). The timestamp column further enhances the precision by indicating when each line was spoken during an episode.

    In summary, this comprehensive collection showcases years' worth of captivating storytelling and insightful discussions from This American Life

    How to use the dataset

    • Exploring Episode Information:

      • The episode_number column represents the number assigned to each episode of the podcast. You can use this column to identify and filter specific episodes based on their number.
      • The title column contains the title of each episode. You can utilize it to search for episodes related to specific topics or themes.
      • The radio_date column indicates when an episode was aired on the radio. It helps in understanding chronological order and exploring episodes released during specific time periods.
    • Analyzing Speaker Information:

      • The speaker_class column classifies speakers into different categories such as host, guest, or narrator. You can analyze speakers based on their roles or categories throughout various episodes.
      • By examining individual speakers' lines using the line_text column, you can explore patterns in speech or track conversations involving specific individuals.
    • Understanding Act/Segment Details:

      • Some episodes may have multiple acts or segments that cover different stories within a single episode. The act_name column provides insight into these act titles or segment names.
    • Utilizing Timestamps:

      • Each line spoken by a speaker is associated with a timestamp represented in the timestamp field.This enables mapping spoken lines with specific points within an episode.

    5: Textual Analysis: * Perform sentiment analysis by analyzing text-based sentiments expressed by different speakers across various episodes. * Conduct topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify recurring themes or topics discussed in This American Life episodes. * Utilize natural language processing techniques to understand linguistic patterns, word frequencies, and sentiment changes over time or across different speakers.

    Please note: - Ensure you have basic knowledge of data manipulation, analysis, and visualization techniques. - Consider preprocessing the text data by cleaning punctuations, stopwords, and normalizing words for optimal analysis results. - Feel free to combine this dataset with external sources like additional transcripts for comprehensive analysis.

    Research Ideas

    • Sentiment Analysis: With the transcript data and speaker information, this dataset can be used to perform sentiment analysis on each line spoken by different speakers in the podcast episodes. This can provide insights into the overall tone and sentiment of the podcast episodes.
    • Speaker Analysis: By analyzing the speaker information and their respective lines, this dataset can be used to analyze patterns in terms of who speaks more or less frequently, which speakers are more prominent or influential in certain episodes or acts, and how different speakers contribute to the narrative structure of each episode.
    • Topic Modeling: By using natural language processing techniques, this dataset can be used for topic modeling analysis to identify recurring themes or topics discussed in This American Life episodes. This can help uncover patterns or track how certain topics have evolved over time throughout the podcast's history

    Acknowledgements

    If yo...

  8. Podcast PR Contacts - Self-Service CSV Batch Export

    • datarade.ai
    .csv, .xls
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). Podcast PR Contacts - Self-Service CSV Batch Export [Dataset]. https://datarade.ai/data-products/podcast-pr-contacts-self-service-csv-batch-export-listen-notes
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Listen Notes
    Area covered
    Algeria, Costa Rica, Israel, Gibraltar, Kuwait, Bulgaria, French Polynesia, Congo, Benin, Dominican Republic
    Description

    == Quick starts ==

    Batch export podcast metadata to CSV files:

    1) Export by search keyword: https://www.listennotes.com/podcast-datasets/keyword/

    2) Export by category: https://www.listennotes.com/podcast-datasets/category/

    == Quick facts ==

    The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in CSV format

    == Data Attributes ==

    See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

    How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

    == Custom Offers ==

    We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

    We also provide a RESTful API at PodcastAPI.com

    Contact us: hello@listennotes.com

    == Need Help? ==

    If you have any questions about our products, feel free to reach out hello@listennotes.com

    == About Listen Notes, Inc. ==

    Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.

  9. o

    Lex Fridman Conversations Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Lex Fridman Conversations Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/7ac3f8a4-bf56-46c6-b743-4a5e246f7940
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    Dive deep into the fascinating world of conversations between Lex Fridman and his esteemed guests with this dataset of podcast transcripts. It features discussions with thought leaders from diverse fields such as technology, science, philosophy, and art, offering a treasure trove of insights and wisdom. Researchers, data scientists, and enthusiasts can explore the nuances of each conversation, uncover emerging trends, and gain valuable knowledge through text analysis, enabling a deeper understanding of human knowledge and curiosity. Each entry includes details such as the guest's name, episode title, and the transcript text, providing a rich resource for analysis and exploration.

    Columns

    • id: Episode ID
    • guest: Name of the guest that appeared in the episode
    • title: Title of the episode
    • text: Transcript of the episode

    Distribution

    This dataset is typically provided in a CSV file format. Specific numbers for the total rows or records are not detailed in the available information, though unique values for guests (317) and titles (318) are noted. A sample file will be updated separately to the platform.

    Usage

    This dataset is ideal for: * Uncovering trends and extracting key insights from podcast discussions. * Gaining a deeper understanding of topics discussed on the podcast. * Conducting sentiment analysis on conversations. * Performing topic modelling to identify key themes. * Any other text analysis tasks involving in-depth human conversations.

    Coverage

    The dataset's content is global in its scope, reflecting the international reach of the podcast and its guests. Specific time ranges for the podcast episodes themselves are not provided in the available details.

    License

    CC0

    Who Can Use It

    • Researchers: For academic studies on discourse, knowledge propagation, or specific domain analysis.
    • Data Scientists: To perform advanced text analysis, build Natural Language Processing (NLP) models, or extract structured insights from unstructured text.
    • Enthusiasts: Individuals interested in exploring the discussions of prominent figures in technology, science, philosophy, and art.

    Dataset Name Suggestions

    • Lex Fridman Podcast Transcripts
    • Lex Fridman Conversations Dataset
    • Lex Fridman Show Transcripts
    • Thought Leader Podcast Transcripts

    Attributes

    Original Data Source: Lex Fridman Podcast Transcript

  10. h

    SPoRC

    • huggingface.co
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Litterer (2024). SPoRC [Dataset]. https://huggingface.co/datasets/blitt/SPoRC
    Explore at:
    Dataset updated
    Nov 8, 2024
    Authors
    Ben Litterer
    Description

    SPORC: the Structured Podcast Open Research Corpus (V 1.0)

    SPORC is a large multimodal dataset for the study of the podcast ecosystem. Included in our data are podcast metadata, transcripts, speaker-turn labels, speaker-role labels, and speaker audio features. For more information on the collection and processing of this data alongside an initial analysis of the podcast ecosystem please refer to our paper here or our github repositories for analysis and data processing. Our dataset… See the full description on the dataset page: https://huggingface.co/datasets/blitt/SPoRC.

  11. r

    Data from: Why people listen: Motivations and outcomes of podcast listening

    • researchdata.edu.au
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Stephanie Tobin; Dr Stephanie Tobin (2022). Why people listen: Motivations and outcomes of podcast listening [Dataset]. https://researchdata.edu.au/why-people-listen-podcast-listening/1944842
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Queensland University of Technology
    Authors
    Dr Stephanie Tobin; Dr Stephanie Tobin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jul 31, 2019 - Aug 3, 2019
    Description

    This SPSS dataset is from a 2019 survey conducted via . There are 323 participants in the file, 306 with complete data for the key measures. Measures include the Big Five Inventory, the Interest/Deprivation Curiosity Scale, the Need for Cognition Scale, the Need to Belong Scale, the Basic Psychological Need Satisfaction Scale, the General Belongingness Scale, the Meaning in Life Questionnaire, the Mindful Attention Awareness Scale, the Smartphone Addiction Scale, and some questions about listening to podcasts.

    In relation to podcasts, participants were first asked if they had ever listened to a podcast. Those who said yes (N = 240) were asked questions related to amount of listening, categories and format of podcasts, setting of listening, device used, social engagement around podcasts, and parasocial relationships with their favourite podcast host. Participants also indicated their age, gender, and country of residence.

    The datafile contains item ratings and scale scores for all measures. Item wording and response labels are provided in the variable view tab of the downloaded file. Other files available on the OSF site include a syntax file related to the analyses reported in a published paper and a copy of the survey.

  12. Z

    Deezer Podcast Dataset for Topic Modeling

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Epure, Elena (2022). Deezer Podcast Dataset for Topic Modeling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5834060
    Explore at:
    Dataset updated
    Jan 11, 2022
    Dataset provided by
    Baranes, Marion
    Epure, Elena
    Valero, Francisco B.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We release a new dataset consisting of podcast metadata (title and description) for 29 539 shows. This dataset can be used to reproduce the experiments from the article Topic Modeling on Podcast Short-Text Metadata accepted at the ECIR 2022 conference.

    More information about this data and how it should be used in experiments can be found in our paper and GitHub repository.

    Please cite our paper if you use the code or data.

  13. 80,000 hours podcast all transcripts

    • kaggle.com
    Updated Mar 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Gravrok (2020). 80,000 hours podcast all transcripts [Dataset]. https://www.kaggle.com/andreasgravrok/80000-hours-podcast-all-transcripts/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Andreas Gravrok
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Context

    In my chase for creating an effective altruism chatbot, i decided that i needed plenty of EA lingo to train on. So here are the transcripts of the first 70 ish podcasts from 80,000 hours. I used it together with gpt-2.

    Content

    The data is scraped from the transcripts from the 80,000 hours podcast site so there are a number of timestamps which are all surrounded by brackets. And almost every paragraph has either Robert Wiblin or a guest's name in front of it.

    Acknowledgements

    Find all the podcasts here https://80000hours.org/podcast/

    Inspiration

    A lot of potential for NLP analysis. How does it score on Flesch–Kincaid readability tests? How positive are they to certain topics? How does their vocabulary compare to the rest of the internet?

  14. Podcast metadata by category

    • listennotes.com
    csv
    Updated Nov 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes, Inc. (2019). Podcast metadata by category [Dataset]. https://www.listennotes.com/podcast-datasets/category/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 21, 2019
    Dataset provided by
    Listen Notes
    Authors
    Listen Notes, Inc.
    License

    https://www.listennotes.com/podcast-datasets/category/#termshttps://www.listennotes.com/podcast-datasets/category/#terms

    Description

    Batch export all podcasts in specific countries, languages or genres.

  15. P

    Personal Events in Dialogue Corpus Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua Eisenberg; Michael Sheriff (2020). Personal Events in Dialogue Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/personal-events-in-dialogue-corpus
    Explore at:
    Dataset updated
    May 22, 2020
    Authors
    Joshua Eisenberg; Michael Sheriff
    Description

    The PEDC is a corpus of 14 episodes of This American Life podcast transcripts that have been annotated for events. The corpus contains excerpts from these episodes (listed in Tabe 1) that are dialogue. The granularity of annotation in this corpus is the token; each token is either annotated as an event, or a nonevent. For more information please download the corpus, and see the annotation guide for more specifics on how we define event, and the README for how the annotations are encoded. Also, much more information regarding the corpus, and its use is in the Automatic extraction of personal events from dialogue paper.

  16. Data from: Joe Rogan Experience 1169 - Elon Musk

    • kaggle.com
    Updated Jul 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Lillelund (2020). Joe Rogan Experience 1169 - Elon Musk [Dataset]. https://www.kaggle.com/christianlillelund/joe-rogan-experience-1169-elon-musk/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2020
    Dataset provided by
    Kaggle
    Authors
    Christian Lillelund
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://dxt.resized.co/dexerto/eyJkYXRhIjoie1widXJsXCI6XCJodHRwczpcXFwvXFxcL2ltYWdlcy5kZXhlcnRvLmNvbVxcXC91cGxvYWRzXFxcLzIwMjBcXFwvMDVcXFwvMDcxNDE5MDNcXFwvam9lLXJvZ2FuLWVsb24tbXVzay1wb2RjYXN0LWVwaXNvZGUtanJlLXdoZW4uanBlZ1wiLFwid2lkdGhcIjpcIlwiLFwiaGVpZ2h0XCI6XCJcIixcImRlZmF1bHRcIjpcImh0dHBzOlxcXC9cXFwvaW1hZ2VzLmRleGVydG8uY29tXFxcL3VwbG9hZHNcXFwvMjAxOVxcXC8xMVxcXC8xMTIxNDk0M1xcXC9wbGFjZWhvbGRlci5qcGdcIixcIm9wdGlvbnNcIjpbXX0iLCJoYXNoIjoiNjE2MzY2ZDBhMzM1MTJiNjQ4YmJkN2VhZDU1NWZmMjcyNDMwZDE5ZCJ9/joe-rogan-elon-musk-podcast-episode-jre-when.jpeg" alt="">

    Joe Rogan Experience #1169 - Elon Musk

    Elon Musk was a guest on Rogan's podcast in September 2018, which turned out to be one of the most epic episodes ever, with Musk talking for roughly two and a half hours about topics such as artificial intelligence, the possibility of being in a computer simulation, and fancy Japanese swords

    This is the full interview between the two and lasts for two and a half hours.

    Original transcript: https://sonix.ai/resources/full-transcript-joe-rogan-experience-elon-musk/

    Data Dictionary

    VariableDefinition
    TimestampWhen the phrase was said.
    SpeakerName of the person who speaks.
    TextThe actual phrase.

    A few examples from the dataset:

    [00:00:00] Joe Rogan Ah, ha, ha, ha. Four, three, two, one, boom. Thank you. Thanks for doing this, man. Really appreciate it.

    [00:02:29] Joe Rogan How many did you make?

    [00:48:49] Joe Rogan Are you a proponent of the multi-universe's theory? Do you believe that there are many, many universes, and thateven if this one fades out that there's other ones that are starting fresh right now, and there's an infinite number ofthem, and they're just constantly in a never-ending cycle of birth and death?

    Video

    Watch the interview here: https://www.youtube.com/watch?v=ycPr5-27vSI

  17. h

    singaporean-podcast-youtube

    • huggingface.co
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malaysia AI (2024). singaporean-podcast-youtube [Dataset]. https://huggingface.co/datasets/malaysia-ai/singaporean-podcast-youtube
    Explore at:
    Dataset updated
    Nov 4, 2024
    Dataset authored and provided by
    Malaysia AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Singapore, YouTube
    Description

    Crawl Youtube Singaporean Podcast

    With total 3451 audio files, total 1254.6 hours.

      how to download
    

    huggingface-cli download --repo-type dataset
    --include '*.z*'
    --local-dir './'
    malaysia-ai/singaporean-podcast-youtube

    wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x sg-podcast.zip -y -mmt40

      Licensing
    

    All the videos, songs, images, and graphics used in the video belong to their respective owners and I does not… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/singaporean-podcast-youtube.

  18. h

    talk_tuah_podcasts

    • huggingface.co
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elijah Kurien (2025). talk_tuah_podcasts [Dataset]. https://huggingface.co/datasets/elijah0528/talk_tuah_podcasts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2025
    Authors
    Elijah Kurien
    Description

    Talk Tuah 1

    This file is the dataset containing every Talk Tuah podcast transcript. Talk-Tuah-1 is an 80 million parameter GPT trained on all of Hailey Welch's inspirational podcast 'Talk Tuah'. This SOTA frontier model is trained on 13 hours of 'Talk Tuah'. The rationale was the discourse in the 'Talk Tuah' podcast is the most enlightened media that any human has created. Therefore, it should outperform any other LLM on any benchmark. With sufficient training and additional compute… See the full description on the dataset page: https://huggingface.co/datasets/elijah0528/talk_tuah_podcasts.

  19. h

    hawk-tuah

    • huggingface.co
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sleeping AI (2024). hawk-tuah [Dataset]. https://huggingface.co/datasets/sleeping-ai/hawk-tuah
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2024
    Dataset authored and provided by
    Sleeping AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Talk Tuah Podcast Dataset: Exploring Memes and Crypto Influence

      Overview
    

    The Talk Tuah Podcast Dataset delves into the significant role podcasts play in shaping internet trends and crypto meme culture. Focused on the Talk Tuah Podcast, this dataset examines how the podcast contributed to the viral rise of the Hawk Tuah meme, its eventual association with crypto ventures, and the financial impacts on individuals.
    The dataset captures insights into the podcast’s… See the full description on the dataset page: https://huggingface.co/datasets/sleeping-ai/hawk-tuah.

  20. h

    malaysian-podcast-youtube

    • huggingface.co
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malaysia AI (2024). malaysian-podcast-youtube [Dataset]. https://huggingface.co/datasets/malaysia-ai/malaysian-podcast-youtube
    Explore at:
    Dataset updated
    Oct 30, 2024
    Dataset authored and provided by
    Malaysia AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Malaysia, YouTube
    Description

    Crawl Youtube Malaysian Podcast

    With total 19092 audio files, total 2233.8 hours.

      how to download
    

    huggingface-cli download --repo-type dataset
    --include '*.z*'
    --local-dir './'
    malaysia-ai/malaysian-podcast-youtube

    wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x malaysian-podcast.zip -y -mmt40

      Licensing
    

    All the videos, songs, images, and graphics used in the video belong to their respective owners and I does… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/malaysian-podcast-youtube.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones (2024). Spotify Podcast Dataset [Dataset]. https://paperswithcode.com/dataset/spotify-podcast

Spotify Podcast Dataset

Explore at:
Dataset updated
Oct 16, 2024
Authors
Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones
Description

A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.

Search
Clear search
Close search
Google apps
Main menu