9 datasets found
  1. Audience of Apple and Spotify podcasts in the U.S. 2020-2025

    • statista.com
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Audience of Apple and Spotify podcasts in the U.S. 2020-2025 [Dataset]. https://www.statista.com/statistics/1303252/apple-spotify-podcast-listeners-united-states/
    Explore at:
    Dataset updated
    May 29, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    According to data from February 2022, 32.5 million people in the United States were said to listen to podcasts on Spotify that year, while Apple had 28.5 million podcast listeners. Spotify's and Apple's figures were projected to add up to 42.4 million and 29.2 million by 2025, respectively.

  2. Metadata of all public podcasts

    • listennotes.com
    sqlite
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes, Inc. (2022). Metadata of all public podcasts [Dataset]. https://www.listennotes.com/podcast-datasets/solutions/
    Explore at:
    sqliteAvailable download formats
    Dataset updated
    Mar 23, 2022
    Dataset provided by
    Listen Notes
    Authors
    Listen Notes, Inc.
    License

    https://www.listennotes.com/podcast-datasets/solutions/#termshttps://www.listennotes.com/podcast-datasets/solutions/#terms

    Description

    Batch export all publicly accessible podcasts to a SQLite file.

  3. Z

    PodcastFillers

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Oct 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan-Pablo Caceres (2022). PodcastFillers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6609214
    Explore at:
    Dataset updated
    Oct 9, 2022
    Dataset provided by
    Juan-Pablo Caceres
    Justin Salamon
    Ge Zhu
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    OVERVIEW: The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.

    The PodcastFillers dataset homepage: PodcastFillers.github.io The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils

    LICENSE:

    The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.

    Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.

    License for PodcastFillers Dataset metadata

    This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.

    1. GRANT OF LICENSE. 1.1 Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution. 1.2 You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only. 1.3 For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
    2. OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
    3. DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
    4. LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
    5. TERM AND TERMINATION.
      5.1 The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2. 5.2 Any breach of any material provision of this License will automatically terminate the rights granted herein. 5.3 Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License. ## License for PodcastFillers Dataset audio files

    All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.

    ACKNOWLEDGEMENT: Please cite the following paper in work that makes use of this dataset:

    Filler Word Detection and Classification: A Dataset and Benchmark Ge Zhu, Juan-Pablo Caceres and Justin Salamon In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.

    Bibtex

    @inproceedings{Zhu:FillerWords:INTERSPEECH:22, title = {Filler Word Detection and Classification: A Dataset and Benchmark}, booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)}, address = {Incheon, Korea}, month = {Sep.}, url = {https://arxiv.org/abs/2203.15135}, author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin}, year = {2022}, }

    ANNOTATIONS: The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events. Full label vocabulary Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).

    Fillers - Uh: 17,907 - Um: 17,078 - You know: 668 - Other: 315 - Like: 157

    Non-fillers - Words: 12,709 - Repetitions: 9,024 - Breath: 8,288 - Laughter: 6,623 - Music : 5,060 - Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755 - Noise : 2,735 - Overlap (overlapping speakers): 1,484

    Total: 85,803 Consolidated label vocabulary 76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.

    • Words: 21,733
    • Uh: 17,907
    • Um: 17,078
    • Breath: 8,288
    • Laughter: 6,623
    • Music : 5,060

    • Total: 76,689

    The consolidated vocabulary was used to train FillerNet

    For a detailed description of how the dataset was created, please see our paper. Data Split for Machine Learning: To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.

    We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper

    AUDIO FILES:

    1. Full-length podcast episodes (MP3) 199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.

    2. Pre-processed full-length podcast episodes (WAV) 199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav

    3. Pre-processed WAV clips Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.

    The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra” folder.

    Filename format: [pfID].wav where:

    [pfID] = the PodcastFillers ID of the audio clip (see metadata below)

    METADATA:

    1. Speech-to-text podcasts transcripts Speech transcript in JSON format for each podcast episode. Generated using the SpeechMatics STT Filename format: [show name]_[episode name].json.

    Each word in the transcript is annotated as a dictionary: {“confidence”:(float), “duration”:(int), “offset”:(int), “text”:(string)} where “confidence” indicates the STT confidence in the prediction, “duration” (unit:microsecond or 1e-6 second) is the duration of the transcribed word, “offset” (unit:microsecond or 1e-6 second) is the start time of the transcribed word in the full-length recording.

    2.

  4. s

    Spotify’s Podcasts

    • searchlogistics.com
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Spotify’s Podcasts [Dataset]. https://www.searchlogistics.com/learn/statistics/spotify-statistics/
    Explore at:
    Dataset updated
    Mar 24, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There are currently more than 4 million podcast titles on the platform today.

  5. w

    Books called How to get your message out fast & free using podcasts :...

    • workwithdata.com
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Books called How to get your message out fast & free using podcasts : everything you need to know about podcasting explained simply [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=How+to+get+your+message+out+fast+%26+free+using+podcasts+%3A+everything+you+need+to+know+about+podcasting+explained+simply
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books and is filtered where the book is How to get your message out fast & free using podcasts : everything you need to know about podcasting explained simply, featuring 7 columns including author, BNB id, book, book publisher, and ISBN. The preview is ordered by publication date (descending).

  6. Podcast metadata by category

    • listennotes.com
    csv
    Updated Nov 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes, Inc. (2019). Podcast metadata by category [Dataset]. https://www.listennotes.com/podcast-datasets/category/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 21, 2019
    Dataset provided by
    Listen Notes
    Authors
    Listen Notes, Inc.
    License

    https://www.listennotes.com/podcast-datasets/category/#termshttps://www.listennotes.com/podcast-datasets/category/#terms

    Description

    Batch export all podcasts in specific countries, languages or genres.

  7. Digital audio purchases in Argentina 2024

    • statista.com
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Bashir (2025). Digital audio purchases in Argentina 2024 [Dataset]. https://www.statista.com/topics/9770/podcast-consumption-in-latin-america/
    Explore at:
    Dataset updated
    Feb 28, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Umair Bashir
    Description

    When asked about "Digital audio purchases", 12 percent of Argentinian respondents answer "Yes, on downloads". This online survey was conducted in 2024, among 1,045 consumers.As an element of Statista Consumer Insights, our Consumer Insights Global survey offers you up-to-date market research data from over 50 countries and territories worldwide.

  8. h

    gigaspeech

    • huggingface.co
    • paperswithcode.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech
    Explore at:
    Dataset authored and provided by
    SpeechColab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

  9. Podcast metadata by Apple Podcasts IDs (iTunes IDs)

    • listennotes.com
    csv
    Updated Apr 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes, Inc. (2021). Podcast metadata by Apple Podcasts IDs (iTunes IDs) [Dataset]. https://www.listennotes.com/podcast-datasets/faq/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 14, 2021
    Dataset provided by
    Listen Notes
    Authors
    Listen Notes, Inc.
    License

    https://www.listennotes.com/podcast-datasets/faq/#termshttps://www.listennotes.com/podcast-datasets/faq/#terms

    Description

    Batch export all podcasts by Apple Podcasts IDs (iTunes IDs).

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2024). Audience of Apple and Spotify podcasts in the U.S. 2020-2025 [Dataset]. https://www.statista.com/statistics/1303252/apple-spotify-podcast-listeners-united-states/
Organization logo

Audience of Apple and Spotify podcasts in the U.S. 2020-2025

Explore at:
Dataset updated
May 29, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description

According to data from February 2022, 32.5 million people in the United States were said to listen to podcasts on Spotify that year, while Apple had 28.5 million podcast listeners. Spotify's and Apple's figures were projected to add up to 42.4 million and 29.2 million by 2025, respectively.

Search
Clear search
Close search
Google apps
Main menu