22 datasets found

P
Spotify Podcast Dataset
paperswithcode.com
opendatalab.com
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones (2024). Spotify Podcast Dataset [Dataset]. https://paperswithcode.com/dataset/spotify-podcast
Explore at:
Dataset updated
Oct 16, 2024
Authors
Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones
Description
A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
Data from: PodcastMix - a dataset for separating music and speech in...
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolas Schmidt; Jordi Pons; Marius Miron; Nicolas Schmidt; Jordi Pons; Marius Miron (2022). PodcastMix - a dataset for separating music and speech in podcasts [Dataset]. http://doi.org/10.5281/zenodo.5597047
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5597047
Dataset updated
Sep 12, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicolas Schmidt; Jordi Pons; Marius Miron; Nicolas Schmidt; Jordi Pons; Marius Miron
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Note: due to zenodo limitations here we host solely the metadata. the whole dataset can be found at: https://drive.google.com/drive/u/0/folders/1tpg9WXkl4L0zU84AwLQjrFqnP-jw1t7z

We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. It contains audio files at 44.1kHz and the corresponding metadata. For further details check the following paper and the associated GitHub repository:

N. Schmidt, J. Pons, M. Miron, "PodcastMix - a dataset for separating music and speech in podcasts", Interspeech (2022)

N. Schmidt, "PodcastMix - a dataset for separating music and speech in podcasts", Masters thesis, MTG, UPF (2021) https://zenodo.org/record/5554790#.YXLHvNlByWA

https://github.com/MTG/Podcastmix

This dataset contains four parts. Due to zenodo file size limitation we host the training dataset on google drive. We highlight the content of the zenodo archives within brackets:

[metadata] PodcastMix-synth train: large and diverse training set that is programatically generated (with a validation partition). The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.

[metadata] PodcastMix-synth test a programatically generated test set with reference stems to compute evaluation metrics. The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.

[audio and metadata] PodcastMix-real with-reference : a test set with real podcasts with reference stems to compute evaluation metrics. The podcasts are recorded by one of the authors and the source of the music is the FMA dataset.

[audio and metadata] PodcastMix-real no-reference: a test set with real podcasts with only the podcasts mixes for subjective evaluation. The podcasts are compiled from the internet.

The training dataset, PodcastMix-synth may be found at our google drive repository: https://drive.google.com/drive/folders/1tpg9WXkl4L0zU84AwLQjrFqnP-jw1t7z?usp=sharing . The archive comprises 450GB of audio and metadata with the following structure:

[metadata and audio] PodcastMix-synth train: large and diverse training set that is programatically generated (with a validation partition). The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.

[metadata and audio] PodcastMix-synth test a programatically generated test set with reference stems to compute evaluation metrics. The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.

Make sure you maintain the folder structure of the original dataset when you uncompress these files.

This dataset is created by Nicolas Schmidt, Marius Miron, Music Technology Group - Universitat Pompeu Fabra (Barcelona) and Jordi Pons. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License (CC BY-SA 4.0).

Please acknowledge PodcastMix in Academic Research. When the present dataset is used for academic research, we would highly appreciate if authors quote the following publications:

N. Schmidt, J. Pons, M. Miron, "PodcastMix - a dataset for separating music and speech in podcasts", Interspeech (2022)

N. Schmidt, "PodcastMix - a dataset for separating music and speech in podcasts", Masters thesis, MTG, UPF (2021) https://zenodo.org/record/5554790#.YXLHvNlByWA

The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the UPF is not liable for, and expressly excludes, all liability for loss or damage however and whenever caused to anyone by any use of the dataset or any part of it.

PURPOSES. The data is processed for the general purpose of carrying out research development and innovation studies, works or projects. In particular, but without limitation, the data is processed for the purpose of communicating with Licensee regarding any administrative and legal / judicial purposes.
The "Podcast" ECoG dataset
openneuro.org
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zaid Zada; Samuel A. Nastase; Bobbi Aubrey; Itamar Jalon; Ariel Goldstein; Sebastian Michelmann; Haocheng Wang; Liat Hasenfratz; Werner Doyle; Daniel Friedman; Patricia Dugan; Lucia Melloni; Sasha Devore; Orrin Devinsky; Adeen Flinker; Uri Hasson (2025). The "Podcast" ECoG dataset [Dataset]. http://doi.org/10.18112/openneuro.ds005574.v1.0.2
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds005574.v1.0.2
Dataset updated
Feb 17, 2025
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Zaid Zada; Samuel A. Nastase; Bobbi Aubrey; Itamar Jalon; Ariel Goldstein; Sebastian Michelmann; Haocheng Wang; Liat Hasenfratz; Werner Doyle; Daniel Friedman; Patricia Dugan; Lucia Melloni; Sasha Devore; Orrin Devinsky; Adeen Flinker; Uri Hasson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The "Podcast" ECoG dataset for modeling neural activity during natural story listening.

We introduce the “Podcast” electrocorticography (ECoG) dataset for modeling neural activity supporting natural narrative comprehension. This dataset combines the exceptional spatiotemporal resolution of human intracranial electrophysiology with a naturalistic experimental paradigm for language comprehension. In addition to the raw data, we provide a minimally preprocessed version in the high-gamma spectral band to showcase a simple pipeline and to make it easier to use. Furthermore, we include the auditory stimuli, an aligned word-level transcript, and linguistic features ranging from low-level acoustic properties to large language model (LLM) embeddings. We also include tutorials replicating previous findings and serve as a pedagogical resource and a springboard for new research. The dataset comprises 9 participants with 1,330 electrodes, including grid, depth, and strip electrodes. The participants listened to a 30-minute story with over 5,000 words. By using a natural story with high-fidelity, invasive neural recordings, this dataset offers a unique opportunity to investigate language comprehension.
PodcastFillers
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Oct 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ge Zhu; Ge Zhu; Juan-Pablo Caceres; Justin Salamon; Juan-Pablo Caceres; Justin Salamon (2022). PodcastFillers [Dataset]. http://doi.org/10.5281/zenodo.7121457
Explore at:
zip, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7121457
Dataset updated
Oct 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ge Zhu; Ge Zhu; Juan-Pablo Caceres; Justin Salamon; Juan-Pablo Caceres; Justin Salamon
License
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Description
OVERVIEW:
The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.

The PodcastFillers dataset homepage: PodcastFillers.github.io
The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils

LICENSE:

The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.

Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.

## License for PodcastFillers Dataset metadata

This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.

1. GRANT OF LICENSE.
1.1 Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution.
1.2 You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
1.3 For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
2. OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
3. DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
4. LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
5. TERM AND TERMINATION.
5.1 The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2.
5.2 Any breach of any material provision of this License will automatically terminate the rights granted herein.
5.3 Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License.
## License for PodcastFillers Dataset audio files

All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.

ACKNOWLEDGEMENT:
Please cite the following paper in work that makes use of this dataset:

Filler Word Detection and Classification: A Dataset and Benchmark
Ge Zhu, Juan-Pablo Caceres and Justin Salamon
In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.

Bibtex

@inproceedings{Zhu:FillerWords:INTERSPEECH:22, title = {Filler Word Detection and Classification: A Dataset and Benchmark}, booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)}, address = {Incheon, Korea}, month = {Sep.}, url = {https://arxiv.org/abs/2203.15135}, author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin}, year = {2022}, }

ANNOTATIONS:
The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events.
Full label vocabulary
Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).

Fillers
- Uh: 17,907
- Um: 17,078
- You know: 668
- Other: 315
- Like: 157

Non-fillers
- Words: 12,709
- Repetitions: 9,024
- Breath: 8,288
- Laughter: 6,623
- Music : 5,060
- Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755
- Noise : 2,735
- Overlap (overlapping speakers): 1,484

Total: 85,803
Consolidated label vocabulary
76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.

- Words: 21,733
- Uh: 17,907
- Um: 17,078
- Breath: 8,288
- Laughter: 6,623
- Music : 5,060

- Total: 76,689

The consolidated vocabulary was used to train FillerNet

For a detailed description of how the dataset was created, please see our paper.
Data Split for Machine Learning:
To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.

We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper

AUDIO FILES:

1. Full-length podcast episodes (MP3)
199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.

2. Pre-processed full-length podcast episodes (WAV)
199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav

3. Pre-processed WAV clips
Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.

The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra”
P
MSP-Podcast Dataset
paperswithcode.com
Updated Feb 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). MSP-Podcast Dataset [Dataset]. https://paperswithcode.com/dataset/msp-podcast
Explore at:
Dataset updated
Feb 11, 2021
Description
The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.7 of the corpus has 62,140 speaking turns (100hrs).

Key features of this corpus:

We download available audio recordings with common license. We only use the podcasts that have less restrictive licenses, so we can modify, sell and distribute the corpus (you can use it for commercial product!). Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions. We annotate categorical emotions and attribute based labels at the speaking turn label This is an ongoing effort, where we currently have 62,140 speaking turns (100h). We collect approximately 10,000-13,000 new speaking turns per year. Our goal is to reach 400 hours.
Podcast Database - Complete Podcast Metadata, All Countries & Languages
datarade.ai
.csv, .sql, .json
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-products/podcast-database-complete-podcast-metadata-all-countries-listen-notes
Explore at:
.csv, .sql, .jsonAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Listen Notes
Area covered
Zambia, Turkey, Colombia, Indonesia, Bosnia and Herzegovina, Guinea-Bissau, Slovenia, Anguilla, Gibraltar, Iran (Islamic Republic of)
Description
== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

== Use Cases ==

AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
This American Life Podcast Transcript Dataset
kaggle.com
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). This American Life Podcast Transcript Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/this-american-life-podcast-transcript-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
This American Life Podcast Transcript Dataset

This American Life Podcast Transcripts with Speaker Information and Timestamps

By Chris Jewell [source]

About this dataset

This dataset provides a comprehensive collection of the transcripts for every episode of the popular podcast This American Life since its inception in November 1995. The dataset includes detailed speaker information, timestamps, and act or segment names for each line spoken throughout the episodes.

With a focus on web scraping using Python and utilizing the powerful BeautifulSoup library, this dataset was meticulously created to offer researchers and enthusiasts an invaluable resource for various analytical purposes. Whether it be sentiment analysis, linguistic studies, or other forms of textual analysis, these transcripts provide a rich mine of data waiting to be explored.

The informative columns in this dataset include episode number, radio date (when each episode was aired), title (of each episode), act name (or segment title within an episode), line text (the spoken text by speakers), and speaker class (categorizing speakers into different roles such as host, guest, narrator). The timestamp column further enhances the precision by indicating when each line was spoken during an episode.

In summary, this comprehensive collection showcases years' worth of captivating storytelling and insightful discussions from This American Life

How to use the dataset

Exploring Episode Information:

The episode_number column represents the number assigned to each episode of the podcast. You can use this column to identify and filter specific episodes based on their number.

The title column contains the title of each episode. You can utilize it to search for episodes related to specific topics or themes.

The radio_date column indicates when an episode was aired on the radio. It helps in understanding chronological order and exploring episodes released during specific time periods.

Analyzing Speaker Information:

The speaker_class column classifies speakers into different categories such as host, guest, or narrator. You can analyze speakers based on their roles or categories throughout various episodes.

By examining individual speakers' lines using the line_text column, you can explore patterns in speech or track conversations involving specific individuals.

Understanding Act/Segment Details:

Some episodes may have multiple acts or segments that cover different stories within a single episode. The act_name column provides insight into these act titles or segment names.

Utilizing Timestamps:

Each line spoken by a speaker is associated with a timestamp represented in the timestamp field.This enables mapping spoken lines with specific points within an episode.

5: Textual Analysis: * Perform sentiment analysis by analyzing text-based sentiments expressed by different speakers across various episodes. * Conduct topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify recurring themes or topics discussed in This American Life episodes. * Utilize natural language processing techniques to understand linguistic patterns, word frequencies, and sentiment changes over time or across different speakers.

Please note: - Ensure you have basic knowledge of data manipulation, analysis, and visualization techniques. - Consider preprocessing the text data by cleaning punctuations, stopwords, and normalizing words for optimal analysis results. - Feel free to combine this dataset with external sources like additional transcripts for comprehensive analysis.

Research Ideas

Sentiment Analysis: With the transcript data and speaker information, this dataset can be used to perform sentiment analysis on each line spoken by different speakers in the podcast episodes. This can provide insights into the overall tone and sentiment of the podcast episodes.

Speaker Analysis: By analyzing the speaker information and their respective lines, this dataset can be used to analyze patterns in terms of who speaks more or less frequently, which speakers are more prominent or influential in certain episodes or acts, and how different speakers contribute to the narrative structure of each episode.

Topic Modeling: By using natural language processing techniques, this dataset can be used for topic modeling analysis to identify recurring themes or topics discussed in This American Life episodes. This can help uncover patterns or track how certain topics have evolved over time throughout the podcast's history

Acknowledgements

If yo...
Podcast PR Contacts - Self-Service CSV Batch Export
datarade.ai
.csv, .xls
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). Podcast PR Contacts - Self-Service CSV Batch Export [Dataset]. https://datarade.ai/data-products/podcast-pr-contacts-self-service-csv-batch-export-listen-notes
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Listen Notes
Area covered
Algeria, Costa Rica, Israel, Gibraltar, Kuwait, Bulgaria, French Polynesia, Congo, Benin, Dominican Republic
Description
== Quick starts ==

Batch export podcast metadata to CSV files:

1) Export by search keyword: https://www.listennotes.com/podcast-datasets/keyword/

2) Export by category: https://www.listennotes.com/podcast-datasets/category/

== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in CSV format

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
o
Lex Fridman Conversations Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Lex Fridman Conversations Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/7ac3f8a4-bf56-46c6-b743-4a5e246f7940
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Dive deep into the fascinating world of conversations between Lex Fridman and his esteemed guests with this dataset of podcast transcripts. It features discussions with thought leaders from diverse fields such as technology, science, philosophy, and art, offering a treasure trove of insights and wisdom. Researchers, data scientists, and enthusiasts can explore the nuances of each conversation, uncover emerging trends, and gain valuable knowledge through text analysis, enabling a deeper understanding of human knowledge and curiosity. Each entry includes details such as the guest's name, episode title, and the transcript text, providing a rich resource for analysis and exploration.

Columns

id: Episode ID

guest: Name of the guest that appeared in the episode

title: Title of the episode

text: Transcript of the episode

Distribution

This dataset is typically provided in a CSV file format. Specific numbers for the total rows or records are not detailed in the available information, though unique values for guests (317) and titles (318) are noted. A sample file will be updated separately to the platform.

Usage

This dataset is ideal for: * Uncovering trends and extracting key insights from podcast discussions. * Gaining a deeper understanding of topics discussed on the podcast. * Conducting sentiment analysis on conversations. * Performing topic modelling to identify key themes. * Any other text analysis tasks involving in-depth human conversations.

Coverage

The dataset's content is global in its scope, reflecting the international reach of the podcast and its guests. Specific time ranges for the podcast episodes themselves are not provided in the available details.

License

CC0

Who Can Use It

Researchers: For academic studies on discourse, knowledge propagation, or specific domain analysis.

Data Scientists: To perform advanced text analysis, build Natural Language Processing (NLP) models, or extract structured insights from unstructured text.

Enthusiasts: Individuals interested in exploring the discussions of prominent figures in technology, science, philosophy, and art.

Dataset Name Suggestions

Lex Fridman Podcast Transcripts

Lex Fridman Conversations Dataset

Lex Fridman Show Transcripts

Thought Leader Podcast Transcripts

Attributes

Original Data Source: Lex Fridman Podcast Transcript
h
SPoRC
huggingface.co
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Litterer (2024). SPoRC [Dataset]. https://huggingface.co/datasets/blitt/SPoRC
Explore at:
Dataset updated
Nov 8, 2024
Authors
Ben Litterer
Description
SPORC: the Structured Podcast Open Research Corpus (V 1.0)

SPORC is a large multimodal dataset for the study of the podcast ecosystem. Included in our data are podcast metadata, transcripts, speaker-turn labels, speaker-role labels, and speaker audio features. For more information on the collection and processing of this data alongside an initial analysis of the podcast ecosystem please refer to our paper here or our github repositories for analysis and data processing. Our dataset… See the full description on the dataset page: https://huggingface.co/datasets/blitt/SPoRC.
r
Data from: Why people listen: Motivations and outcomes of podcast listening
researchdata.edu.au
Updated May 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Stephanie Tobin; Dr Stephanie Tobin (2022). Why people listen: Motivations and outcomes of podcast listening [Dataset]. https://researchdata.edu.au/why-people-listen-podcast-listening/1944842
Explore at:
Dataset updated
May 5, 2022
Dataset provided by
Queensland University of Technology
Authors
Dr Stephanie Tobin; Dr Stephanie Tobin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jul 31, 2019 - Aug 3, 2019
Description
This SPSS dataset is from a 2019 survey conducted via . There are 323 participants in the file, 306 with complete data for the key measures. Measures include the Big Five Inventory, the Interest/Deprivation Curiosity Scale, the Need for Cognition Scale, the Need to Belong Scale, the Basic Psychological Need Satisfaction Scale, the General Belongingness Scale, the Meaning in Life Questionnaire, the Mindful Attention Awareness Scale, the Smartphone Addiction Scale, and some questions about listening to podcasts.

In relation to podcasts, participants were first asked if they had ever listened to a podcast. Those who said yes (N = 240) were asked questions related to amount of listening, categories and format of podcasts, setting of listening, device used, social engagement around podcasts, and parasocial relationships with their favourite podcast host. Participants also indicated their age, gender, and country of residence.

The datafile contains item ratings and scale scores for all measures. Item wording and response labels are provided in the variable view tab of the downloaded file. Other files available on the OSF site include a syntax file related to the analyses reported in a published paper and a copy of the survey.
Z
Deezer Podcast Dataset for Topic Modeling
data.niaid.nih.gov
zenodo.org
Updated Jan 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epure, Elena (2022). Deezer Podcast Dataset for Topic Modeling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5834060
Explore at:
Dataset updated
Jan 11, 2022
Dataset provided by
Baranes, Marion
Epure, Elena
Valero, Francisco B.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We release a new dataset consisting of podcast metadata (title and description) for 29 539 shows. This dataset can be used to reproduce the experiments from the article Topic Modeling on Podcast Short-Text Metadata accepted at the ECIR 2022 conference.

More information about this data and how it should be used in experiments can be found in our paper and GitHub repository.

Please cite our paper if you use the code or data.
80,000 hours podcast all transcripts
kaggle.com
Updated Mar 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Gravrok (2020). 80,000 hours podcast all transcripts [Dataset]. https://www.kaggle.com/andreasgravrok/80000-hours-podcast-all-transcripts/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Andreas Gravrok
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Context

In my chase for creating an effective altruism chatbot, i decided that i needed plenty of EA lingo to train on. So here are the transcripts of the first 70 ish podcasts from 80,000 hours. I used it together with gpt-2.

Content

The data is scraped from the transcripts from the 80,000 hours podcast site so there are a number of timestamps which are all surrounded by brackets. And almost every paragraph has either Robert Wiblin or a guest's name in front of it.

Acknowledgements

Find all the podcasts here https://80000hours.org/podcast/

Inspiration

A lot of potential for NLP analysis. How does it score on Flesch–Kincaid readability tests? How positive are they to certain topics? How does their vocabulary compare to the rest of the internet?
Podcast metadata by category
listennotes.com
csv
Updated Nov 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes, Inc. (2019). Podcast metadata by category [Dataset]. https://www.listennotes.com/podcast-datasets/category/
Explore at:
csvAvailable download formats
Dataset updated
Nov 21, 2019
Dataset provided by
Listen Notes
Authors
Listen Notes, Inc.
License
https://www.listennotes.com/podcast-datasets/category/#termshttps://www.listennotes.com/podcast-datasets/category/#terms
Description
Batch export all podcasts in specific countries, languages or genres.
P
Personal Events in Dialogue Corpus Dataset
paperswithcode.com
opendatalab.com
Updated May 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua Eisenberg; Michael Sheriff (2020). Personal Events in Dialogue Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/personal-events-in-dialogue-corpus
Explore at:
Dataset updated
May 22, 2020
Authors
Joshua Eisenberg; Michael Sheriff
Description
The PEDC is a corpus of 14 episodes of This American Life podcast transcripts that have been annotated for events. The corpus contains excerpts from these episodes (listed in Tabe 1) that are dialogue. The granularity of annotation in this corpus is the token; each token is either annotated as an event, or a nonevent. For more information please download the corpus, and see the annotation guide for more specifics on how we define event, and the README for how the annotations are encoded. Also, much more information regarding the corpus, and its use is in the Automatic extraction of personal events from dialogue paper.
Data from: Joe Rogan Experience 1169 - Elon Musk
kaggle.com
Updated Jul 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Lillelund (2020). Joe Rogan Experience 1169 - Elon Musk [Dataset]. https://www.kaggle.com/christianlillelund/joe-rogan-experience-1169-elon-musk/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2020
Dataset provided by
Kaggle
Authors
Christian Lillelund
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://dxt.resized.co/dexerto/eyJkYXRhIjoie1widXJsXCI6XCJodHRwczpcXFwvXFxcL2ltYWdlcy5kZXhlcnRvLmNvbVxcXC91cGxvYWRzXFxcLzIwMjBcXFwvMDVcXFwvMDcxNDE5MDNcXFwvam9lLXJvZ2FuLWVsb24tbXVzay1wb2RjYXN0LWVwaXNvZGUtanJlLXdoZW4uanBlZ1wiLFwid2lkdGhcIjpcIlwiLFwiaGVpZ2h0XCI6XCJcIixcImRlZmF1bHRcIjpcImh0dHBzOlxcXC9cXFwvaW1hZ2VzLmRleGVydG8uY29tXFxcL3VwbG9hZHNcXFwvMjAxOVxcXC8xMVxcXC8xMTIxNDk0M1xcXC9wbGFjZWhvbGRlci5qcGdcIixcIm9wdGlvbnNcIjpbXX0iLCJoYXNoIjoiNjE2MzY2ZDBhMzM1MTJiNjQ4YmJkN2VhZDU1NWZmMjcyNDMwZDE5ZCJ9/joe-rogan-elon-musk-podcast-episode-jre-when.jpeg" alt="">

Joe Rogan Experience #1169 - Elon Musk

Elon Musk was a guest on Rogan's podcast in September 2018, which turned out to be one of the most epic episodes ever, with Musk talking for roughly two and a half hours about topics such as artificial intelligence, the possibility of being in a computer simulation, and fancy Japanese swords

This is the full interview between the two and lasts for two and a half hours.

Original transcript: https://sonix.ai/resources/full-transcript-joe-rogan-experience-elon-musk/

Data Dictionary

Variable Definition
Timestamp When the phrase was said.
Speaker Name of the person who speaks.
Text The actual phrase.

A few examples from the dataset:

[00:00:00] Joe Rogan Ah, ha, ha, ha. Four, three, two, one, boom. Thank you. Thanks for doing this, man. Really appreciate it.

[00:02:29] Joe Rogan How many did you make?

[00:48:49] Joe Rogan Are you a proponent of the multi-universe's theory? Do you believe that there are many, many universes, and thateven if this one fades out that there's other ones that are starting fresh right now, and there's an infinite number ofthem, and they're just constantly in a never-ending cycle of birth and death?

Video

Watch the interview here: https://www.youtube.com/watch?v=ycPr5-27vSI
h
singaporean-podcast-youtube
huggingface.co
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malaysia AI (2024). singaporean-podcast-youtube [Dataset]. https://huggingface.co/datasets/malaysia-ai/singaporean-podcast-youtube
Explore at:
Dataset updated
Nov 4, 2024
Dataset authored and provided by
Malaysia AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
Singapore, YouTube
Description
Crawl Youtube Singaporean Podcast

With total 3451 audio files, total 1254.6 hours.

how to download

huggingface-cli download --repo-type dataset
--include '*.z*'
--local-dir './'
malaysia-ai/singaporean-podcast-youtube

wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x sg-podcast.zip -y -mmt40

Licensing

All the videos, songs, images, and graphics used in the video belong to their respective owners and I does not… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/singaporean-podcast-youtube.
h
talk_tuah_podcasts
huggingface.co
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elijah Kurien (2025). talk_tuah_podcasts [Dataset]. https://huggingface.co/datasets/elijah0528/talk_tuah_podcasts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Authors
Elijah Kurien
Description
Talk Tuah 1

This file is the dataset containing every Talk Tuah podcast transcript. Talk-Tuah-1 is an 80 million parameter GPT trained on all of Hailey Welch's inspirational podcast 'Talk Tuah'. This SOTA frontier model is trained on 13 hours of 'Talk Tuah'. The rationale was the discourse in the 'Talk Tuah' podcast is the most enlightened media that any human has created. Therefore, it should outperform any other LLM on any benchmark. With sufficient training and additional compute… See the full description on the dataset page: https://huggingface.co/datasets/elijah0528/talk_tuah_podcasts.
h
hawk-tuah
huggingface.co
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sleeping AI (2024). hawk-tuah [Dataset]. https://huggingface.co/datasets/sleeping-ai/hawk-tuah
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2024
Dataset authored and provided by
Sleeping AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Talk Tuah Podcast Dataset: Exploring Memes and Crypto Influence

Overview

The Talk Tuah Podcast Dataset delves into the significant role podcasts play in shaping internet trends and crypto meme culture. Focused on the Talk Tuah Podcast, this dataset examines how the podcast contributed to the viral rise of the Hawk Tuah meme, its eventual association with crypto ventures, and the financial impacts on individuals.
The dataset captures insights into the podcast’s… See the full description on the dataset page: https://huggingface.co/datasets/sleeping-ai/hawk-tuah.
h
malaysian-podcast-youtube
huggingface.co
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malaysia AI (2024). malaysian-podcast-youtube [Dataset]. https://huggingface.co/datasets/malaysia-ai/malaysian-podcast-youtube
Explore at:
Dataset updated
Oct 30, 2024
Dataset authored and provided by
Malaysia AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
Malaysia, YouTube
Description
Crawl Youtube Malaysian Podcast

With total 19092 audio files, total 2233.8 hours.

how to download

huggingface-cli download --repo-type dataset
--include '*.z*'
--local-dir './'
malaysia-ai/malaysian-podcast-youtube

wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x malaysian-podcast.zip -y -mmt40

Licensing

All the videos, songs, images, and graphics used in the video belong to their respective owners and I does… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/malaysian-podcast-youtube.

Variable	Definition
Timestamp	When the phrase was said.
Speaker	Name of the person who speaks.
Text	The actual phrase.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones (2024). Spotify Podcast Dataset [Dataset]. https://paperswithcode.com/dataset/spotify-podcast

Spotify Podcast Dataset

Explore at:

Dataset updated

Oct 16, 2024

Authors

Ann Clifton; Aasish Pappu; Sravana Reddy; Yongze Yu; Jussi Karlgren; Ben Carterette; Rosie Jones

Description

A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.

Clear search

Close search

Google apps

Main menu

Spotify Podcast Dataset

Data from: PodcastMix - a dataset for separating music and speech in...

The "Podcast" ECoG dataset

PodcastFillers

MSP-Podcast Dataset

Podcast Database - Complete Podcast Metadata, All Countries & Languages

This American Life Podcast Transcript Dataset

This American Life Podcast Transcript Dataset

This American Life Podcast Transcripts with Speaker Information and Timestamps

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

Podcast PR Contacts - Self-Service CSV Batch Export

Lex Fridman Conversations Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

SPoRC

Data from: Why people listen: Motivations and outcomes of podcast listening

Deezer Podcast Dataset for Topic Modeling

80,000 hours podcast all transcripts

Context

Content

Acknowledgements

Inspiration

Podcast metadata by category

Personal Events in Dialogue Corpus Dataset

Data from: Joe Rogan Experience 1169 - Elon Musk

Joe Rogan Experience #1169 - Elon Musk

Data Dictionary

Video

singaporean-podcast-youtube

talk_tuah_podcasts

hawk-tuah

malaysian-podcast-youtube

Spotify Podcast Dataset