A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: due to zenodo limitations here we host solely the metadata. the whole dataset can be found at: https://drive.google.com/drive/u/0/folders/1tpg9WXkl4L0zU84AwLQjrFqnP-jw1t7z
We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. It contains audio files at 44.1kHz and the corresponding metadata. For further details check the following paper and the associated GitHub repository:
This dataset contains four parts. Due to zenodo file size limitation we host the training dataset on google drive. We highlight the content of the zenodo archives within brackets:
The training dataset, PodcastMix-synth may be found at our google drive repository: https://drive.google.com/drive/folders/1tpg9WXkl4L0zU84AwLQjrFqnP-jw1t7z?usp=sharing . The archive comprises 450GB of audio and metadata with the following structure:
Make sure you maintain the folder structure of the original dataset when you uncompress these files.
This dataset is created by Nicolas Schmidt, Marius Miron, Music Technology Group - Universitat Pompeu Fabra (Barcelona) and Jordi Pons. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License (CC BY-SA 4.0).
Please acknowledge PodcastMix in Academic Research. When the present dataset is used for academic research, we would highly appreciate if authors quote the following publications:
The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the UPF is not liable for, and expressly excludes, all liability for loss or damage however and whenever caused to anyone by any use of the dataset or any part of it.
PURPOSES. The data is processed for the general purpose of carrying out research development and innovation studies, works or projects. In particular, but without limitation, the data is processed for the purpose of communicating with Licensee regarding any administrative and legal / judicial purposes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The "Podcast" ECoG dataset for modeling neural activity during natural story listening.
We introduce the “Podcast” electrocorticography (ECoG) dataset for modeling neural activity supporting natural narrative comprehension. This dataset combines the exceptional spatiotemporal resolution of human intracranial electrophysiology with a naturalistic experimental paradigm for language comprehension. In addition to the raw data, we provide a minimally preprocessed version in the high-gamma spectral band to showcase a simple pipeline and to make it easier to use. Furthermore, we include the auditory stimuli, an aligned word-level transcript, and linguistic features ranging from low-level acoustic properties to large language model (LLM) embeddings. We also include tutorials replicating previous findings and serve as a pedagogical resource and a springboard for new research. The dataset comprises 9 participants with 1,330 electrodes, including grid, depth, and strip electrodes. The participants listened to a 30-minute story with over 5,000 words. By using a natural story with high-fidelity, invasive neural recordings, this dataset offers a unique opportunity to investigate language comprehension.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
OVERVIEW:
The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.
The PodcastFillers dataset homepage: PodcastFillers.github.io
The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils
LICENSE:
The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.
Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.
## License for PodcastFillers Dataset metadata
This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.
1. GRANT OF LICENSE.
1.1 Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution.
1.2 You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
1.3 For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
2. OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
3. DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
4. LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
5. TERM AND TERMINATION.
5.1 The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2.
5.2 Any breach of any material provision of this License will automatically terminate the rights granted herein.
5.3 Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License.
## License for PodcastFillers Dataset audio files
All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.
ACKNOWLEDGEMENT:
Please cite the following paper in work that makes use of this dataset:
Filler Word Detection and Classification: A Dataset and Benchmark
Ge Zhu, Juan-Pablo Caceres and Justin Salamon
In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.
Bibtex
@inproceedings{Zhu:FillerWords:INTERSPEECH:22,
title = {Filler Word Detection and Classification: A Dataset and Benchmark},
booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)},
address = {Incheon, Korea},
month = {Sep.},
url = {https://arxiv.org/abs/2203.15135},
author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin},
year = {2022},
}
ANNOTATIONS:
The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events.
Full label vocabulary
Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).
Fillers
- Uh: 17,907
- Um: 17,078
- You know: 668
- Other: 315
- Like: 157
Non-fillers
- Words: 12,709
- Repetitions: 9,024
- Breath: 8,288
- Laughter: 6,623
- Music : 5,060
- Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755
- Noise : 2,735
- Overlap (overlapping speakers): 1,484
Total: 85,803
Consolidated label vocabulary
76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.
- Words: 21,733
- Uh: 17,907
- Um: 17,078
- Breath: 8,288
- Laughter: 6,623
- Music : 5,060
- Total: 76,689
The consolidated vocabulary was used to train FillerNet
For a detailed description of how the dataset was created, please see our paper.
Data Split for Machine Learning:
To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.
We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper
AUDIO FILES:
1. Full-length podcast episodes (MP3)
199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.
2. Pre-processed full-length podcast episodes (WAV)
199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav
3. Pre-processed WAV clips
Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.
The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra”
The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.7 of the corpus has 62,140 speaking turns (100hrs).
Key features of this corpus:
We download available audio recordings with common license. We only use the podcasts that have less restrictive licenses, so we can modify, sell and distribute the corpus (you can use it for commercial product!). Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions. We annotate categorical emotions and attribute based labels at the speaking turn label This is an ongoing effort, where we currently have 62,140 speaking turns (100h). We collect approximately 10,000-13,000 new speaking turns per year. Our goal is to reach 400 hours.
== Quick facts ==
The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes
== Use Cases ==
AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...
== Data Attributes ==
See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only
How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.
== Custom Offers ==
We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.
We also provide a RESTful API at PodcastAPI.com
Contact us: hello@listennotes.com
== Need Help? ==
If you have any questions about our products, feel free to reach out hello@listennotes.com
== About Listen Notes, Inc. ==
Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
By Chris Jewell [source]
This dataset provides a comprehensive collection of the transcripts for every episode of the popular podcast This American Life since its inception in November 1995. The dataset includes detailed speaker information, timestamps, and act or segment names for each line spoken throughout the episodes.
With a focus on web scraping using Python and utilizing the powerful BeautifulSoup library, this dataset was meticulously created to offer researchers and enthusiasts an invaluable resource for various analytical purposes. Whether it be sentiment analysis, linguistic studies, or other forms of textual analysis, these transcripts provide a rich mine of data waiting to be explored.
The informative columns in this dataset include episode number, radio date (when each episode was aired), title (of each episode), act name (or segment title within an episode), line text (the spoken text by speakers), and speaker class (categorizing speakers into different roles such as host, guest, narrator). The timestamp column further enhances the precision by indicating when each line was spoken during an episode.
In summary, this comprehensive collection showcases years' worth of captivating storytelling and insightful discussions from This American Life
Exploring Episode Information:
- The
episode_number
column represents the number assigned to each episode of the podcast. You can use this column to identify and filter specific episodes based on their number.- The
title
column contains the title of each episode. You can utilize it to search for episodes related to specific topics or themes.- The
radio_date
column indicates when an episode was aired on the radio. It helps in understanding chronological order and exploring episodes released during specific time periods.Analyzing Speaker Information:
- The
speaker_class
column classifies speakers into different categories such as host, guest, or narrator. You can analyze speakers based on their roles or categories throughout various episodes.- By examining individual speakers' lines using the
line_text
column, you can explore patterns in speech or track conversations involving specific individuals.Understanding Act/Segment Details:
- Some episodes may have multiple acts or segments that cover different stories within a single episode. The
act_name
column provides insight into these act titles or segment names.Utilizing Timestamps:
- Each line spoken by a speaker is associated with a timestamp represented in the
timestamp
field.This enables mapping spoken lines with specific points within an episode.5: Textual Analysis: * Perform sentiment analysis by analyzing text-based sentiments expressed by different speakers across various episodes. * Conduct topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify recurring themes or topics discussed in This American Life episodes. * Utilize natural language processing techniques to understand linguistic patterns, word frequencies, and sentiment changes over time or across different speakers.
Please note: - Ensure you have basic knowledge of data manipulation, analysis, and visualization techniques. - Consider preprocessing the text data by cleaning punctuations, stopwords, and normalizing words for optimal analysis results. - Feel free to combine this dataset with external sources like additional transcripts for comprehensive analysis.
- Sentiment Analysis: With the transcript data and speaker information, this dataset can be used to perform sentiment analysis on each line spoken by different speakers in the podcast episodes. This can provide insights into the overall tone and sentiment of the podcast episodes.
- Speaker Analysis: By analyzing the speaker information and their respective lines, this dataset can be used to analyze patterns in terms of who speaks more or less frequently, which speakers are more prominent or influential in certain episodes or acts, and how different speakers contribute to the narrative structure of each episode.
- Topic Modeling: By using natural language processing techniques, this dataset can be used for topic modeling analysis to identify recurring themes or topics discussed in This American Life episodes. This can help uncover patterns or track how certain topics have evolved over time throughout the podcast's history
If yo...
== Quick starts ==
Batch export podcast metadata to CSV files:
1) Export by search keyword: https://www.listennotes.com/podcast-datasets/keyword/
2) Export by category: https://www.listennotes.com/podcast-datasets/category/
== Quick facts ==
The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in CSV format
== Data Attributes ==
See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only
How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.
== Custom Offers ==
We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.
We also provide a RESTful API at PodcastAPI.com
Contact us: hello@listennotes.com
== Need Help? ==
If you have any questions about our products, feel free to reach out hello@listennotes.com
== About Listen Notes, Inc. ==
Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dive deep into the fascinating world of conversations between Lex Fridman and his esteemed guests with this dataset of podcast transcripts. It features discussions with thought leaders from diverse fields such as technology, science, philosophy, and art, offering a treasure trove of insights and wisdom. Researchers, data scientists, and enthusiasts can explore the nuances of each conversation, uncover emerging trends, and gain valuable knowledge through text analysis, enabling a deeper understanding of human knowledge and curiosity. Each entry includes details such as the guest's name, episode title, and the transcript text, providing a rich resource for analysis and exploration.
This dataset is typically provided in a CSV file format. Specific numbers for the total rows or records are not detailed in the available information, though unique values for guests (317) and titles (318) are noted. A sample file will be updated separately to the platform.
This dataset is ideal for: * Uncovering trends and extracting key insights from podcast discussions. * Gaining a deeper understanding of topics discussed on the podcast. * Conducting sentiment analysis on conversations. * Performing topic modelling to identify key themes. * Any other text analysis tasks involving in-depth human conversations.
The dataset's content is global in its scope, reflecting the international reach of the podcast and its guests. Specific time ranges for the podcast episodes themselves are not provided in the available details.
CC0
Original Data Source: Lex Fridman Podcast Transcript
SPORC: the Structured Podcast Open Research Corpus (V 1.0)
SPORC is a large multimodal dataset for the study of the podcast ecosystem. Included in our data are podcast metadata, transcripts, speaker-turn labels, speaker-role labels, and speaker audio features. For more information on the collection and processing of this data alongside an initial analysis of the podcast ecosystem please refer to our paper here or our github repositories for analysis and data processing. Our dataset… See the full description on the dataset page: https://huggingface.co/datasets/blitt/SPoRC.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This SPSS dataset is from a 2019 survey conducted via . There are 323 participants in the file, 306 with complete data for the key measures. Measures include the Big Five Inventory, the Interest/Deprivation Curiosity Scale, the Need for Cognition Scale, the Need to Belong Scale, the Basic Psychological Need Satisfaction Scale, the General Belongingness Scale, the Meaning in Life Questionnaire, the Mindful Attention Awareness Scale, the Smartphone Addiction Scale, and some questions about listening to podcasts.
In relation to podcasts, participants were first asked if they had ever listened to a podcast. Those who said yes (N = 240) were asked questions related to amount of listening, categories and format of podcasts, setting of listening, device used, social engagement around podcasts, and parasocial relationships with their favourite podcast host. Participants also indicated their age, gender, and country of residence.
The datafile contains item ratings and scale scores for all measures. Item wording and response labels are provided in the variable view tab of the downloaded file. Other files available on the OSF site include a syntax file related to the analyses reported in a published paper and a copy of the survey.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We release a new dataset consisting of podcast metadata (title and description) for 29 539 shows. This dataset can be used to reproduce the experiments from the article Topic Modeling on Podcast Short-Text Metadata accepted at the ECIR 2022 conference.
More information about this data and how it should be used in experiments can be found in our paper and GitHub repository.
Please cite our paper if you use the code or data.
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
In my chase for creating an effective altruism chatbot, i decided that i needed plenty of EA lingo to train on. So here are the transcripts of the first 70 ish podcasts from 80,000 hours. I used it together with gpt-2.
The data is scraped from the transcripts from the 80,000 hours podcast site so there are a number of timestamps which are all surrounded by brackets. And almost every paragraph has either Robert Wiblin or a guest's name in front of it.
Find all the podcasts here https://80000hours.org/podcast/
A lot of potential for NLP analysis. How does it score on Flesch–Kincaid readability tests? How positive are they to certain topics? How does their vocabulary compare to the rest of the internet?
https://www.listennotes.com/podcast-datasets/category/#termshttps://www.listennotes.com/podcast-datasets/category/#terms
Batch export all podcasts in specific countries, languages or genres.
The PEDC is a corpus of 14 episodes of This American Life podcast transcripts that have been annotated for events. The corpus contains excerpts from these episodes (listed in Tabe 1) that are dialogue. The granularity of annotation in this corpus is the token; each token is either annotated as an event, or a nonevent. For more information please download the corpus, and see the annotation guide for more specifics on how we define event, and the README for how the annotations are encoded. Also, much more information regarding the corpus, and its use is in the Automatic extraction of personal events from dialogue paper.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://dxt.resized.co/dexerto/eyJkYXRhIjoie1widXJsXCI6XCJodHRwczpcXFwvXFxcL2ltYWdlcy5kZXhlcnRvLmNvbVxcXC91cGxvYWRzXFxcLzIwMjBcXFwvMDVcXFwvMDcxNDE5MDNcXFwvam9lLXJvZ2FuLWVsb24tbXVzay1wb2RjYXN0LWVwaXNvZGUtanJlLXdoZW4uanBlZ1wiLFwid2lkdGhcIjpcIlwiLFwiaGVpZ2h0XCI6XCJcIixcImRlZmF1bHRcIjpcImh0dHBzOlxcXC9cXFwvaW1hZ2VzLmRleGVydG8uY29tXFxcL3VwbG9hZHNcXFwvMjAxOVxcXC8xMVxcXC8xMTIxNDk0M1xcXC9wbGFjZWhvbGRlci5qcGdcIixcIm9wdGlvbnNcIjpbXX0iLCJoYXNoIjoiNjE2MzY2ZDBhMzM1MTJiNjQ4YmJkN2VhZDU1NWZmMjcyNDMwZDE5ZCJ9/joe-rogan-elon-musk-podcast-episode-jre-when.jpeg" alt="">
Elon Musk was a guest on Rogan's podcast in September 2018, which turned out to be one of the most epic episodes ever, with Musk talking for roughly two and a half hours about topics such as artificial intelligence, the possibility of being in a computer simulation, and fancy Japanese swords
This is the full interview between the two and lasts for two and a half hours.
Original transcript: https://sonix.ai/resources/full-transcript-joe-rogan-experience-elon-musk/
Variable | Definition |
---|---|
Timestamp | When the phrase was said. |
Speaker | Name of the person who speaks. |
Text | The actual phrase. |
A few examples from the dataset:
[00:00:00] Joe Rogan Ah, ha, ha, ha. Four, three, two, one, boom. Thank you. Thanks for doing this, man. Really appreciate it.
[00:02:29] Joe Rogan How many did you make?
[00:48:49] Joe Rogan Are you a proponent of the multi-universe's theory? Do you believe that there are many, many universes, and thateven if this one fades out that there's other ones that are starting fresh right now, and there's an infinite number ofthem, and they're just constantly in a never-ending cycle of birth and death?
Watch the interview here: https://www.youtube.com/watch?v=ycPr5-27vSI
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Crawl Youtube Singaporean Podcast
With total 3451 audio files, total 1254.6 hours.
how to download
huggingface-cli download --repo-type dataset
--include '*.z*'
--local-dir './'
malaysia-ai/singaporean-podcast-youtube
wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x sg-podcast.zip -y -mmt40
Licensing
All the videos, songs, images, and graphics used in the video belong to their respective owners and I does not… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/singaporean-podcast-youtube.
Talk Tuah 1
This file is the dataset containing every Talk Tuah podcast transcript. Talk-Tuah-1 is an 80 million parameter GPT trained on all of Hailey Welch's inspirational podcast 'Talk Tuah'. This SOTA frontier model is trained on 13 hours of 'Talk Tuah'. The rationale was the discourse in the 'Talk Tuah' podcast is the most enlightened media that any human has created. Therefore, it should outperform any other LLM on any benchmark. With sufficient training and additional compute… See the full description on the dataset page: https://huggingface.co/datasets/elijah0528/talk_tuah_podcasts.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Talk Tuah Podcast Dataset: Exploring Memes and Crypto Influence
Overview
The Talk Tuah Podcast Dataset delves into the significant role podcasts play in shaping internet trends and crypto meme culture. Focused on the Talk Tuah Podcast, this dataset examines how the podcast contributed to the viral rise of the Hawk Tuah meme, its eventual association with crypto ventures, and the financial impacts on individuals.
The dataset captures insights into the podcast’s… See the full description on the dataset page: https://huggingface.co/datasets/sleeping-ai/hawk-tuah.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Crawl Youtube Malaysian Podcast
With total 19092 audio files, total 2233.8 hours.
how to download
huggingface-cli download --repo-type dataset
--include '*.z*'
--local-dir './'
malaysia-ai/malaysian-podcast-youtube
wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz tar -xf 7z2301-linux-x64.tar.xz ~/7zz x malaysian-podcast.zip -y -mmt40
Licensing
All the videos, songs, images, and graphics used in the video belong to their respective owners and I does… See the full description on the dataset page: https://huggingface.co/datasets/malaysia-ai/malaysian-podcast-youtube.
A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.