27 datasets found
  1. DEEP-VOICE: DeepFake Voice Recognition

    • kaggle.com
    Updated Aug 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan J. Bird (2023). DEEP-VOICE: DeepFake Voice Recognition [Dataset]. https://www.kaggle.com/datasets/birdy654/deep-voice-deepfake-voice-recognition
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jordan J. Bird
    Description

    DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

    This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

    Can machine learning be used to detect when speech is AI-generated?

    Introduction

    There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

    To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

    For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&alt=media" alt="Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.">

    (Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

    Dataset

    There are two forms to the dataset that are made available.

    First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

    Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

    **Note: ** All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

    A potential use of a successful system could be used for the following:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F7ae536243464f0dbb48f3566765f6b50%2Fdfcover.png?generation=1692897790677119&alt=media" alt="Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.">

    (Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

    Papers with Code

    The dataset and all studies using it are linked using Papers with Code

    The Papers with Code page can be found by clicking here: Papers with Code

    Attribution

    This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

    Bird, J.J. and Lotfi, A., 2023. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.

    The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

    License

    This dataset is provided under the MIT License:

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    *THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT H...

  2. Z

    Data from: WaveFake: A data set to facilitate audio DeepFake detection

    • data.niaid.nih.gov
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SchΓΆnherr, Lea (2024). WaveFake: A data set to facilitate audio DeepFake detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4904578
    Explore at:
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Frank, Joel
    SchΓΆnherr, Lea
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment.

    The data set consists of 104,885 generated audio clips (16-bit PCM wav). We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.

    We included a range of architectures in our data set:

    MelGAN

    Parallel WaveGAN

    Multi-Band MelGAN

    Full-Band MelGAN

    WaveGlow

    Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.

    Collection Process

    For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.

    This data set is licensed with a CC-BY-SA 4.0 license.

    This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.

  3. DEepfake CROss-lingual evaluation dataset (DECRO)

    • zenodo.org
    zip
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhongjie Ba; Qing Wen; Peng Cheng; Yuwei Wang; Feng Lin; Li Lu; Zhenguang Liu; Kui Ren; Zhongjie Ba; Qing Wen; Peng Cheng; Yuwei Wang; Feng Lin; Li Lu; Zhenguang Liu; Kui Ren (2023). DEepfake CROss-lingual evaluation dataset (DECRO) [Dataset]. http://doi.org/10.5281/zenodo.7603208
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zhongjie Ba; Qing Wen; Peng Cheng; Yuwei Wang; Feng Lin; Li Lu; Zhenguang Liu; Kui Ren; Zhongjie Ba; Qing Wen; Peng Cheng; Yuwei Wang; Feng Lin; Li Lu; Zhenguang Liu; Kui Ren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Deepfake cross-lingual evaluation dataset (DECRO) is constructed to evaluate the influence of language differences on deepfake detection.

    If you use DECRO dataset for deepfake detection, please cite the paper "Transferring Audio Deepfake Detection Capability across Languages" published in www'23.

  4. deepfake-detection-demo

    • huggingface.co
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Behavioral Signals (2025). deepfake-detection-demo [Dataset]. https://huggingface.co/datasets/behavioralsignals/deepfake-detection-demo
    Explore at:
    Dataset updated
    Aug 2, 2025
    Dataset provided by
    Behavioral Signal Technologies Inc.
    Authors
    Behavioral Signals
    Description

    Deepfake Detection Demo

    This is a demo evaluation dataset for the task of Deepfake Detection on human speech. This dataset has been created to demonstate the capabalities of Behavioral Signals API.

      Information
    

    The dataset contains 22 utterances, containg an equal amount of genuine ("bonafide") and fake ("spoofed") utterances.Utterances from the "bonafide" class have been sourced from the test set of CommonVoice-17.0 corpus.The "deepfake" utterances have been cloned… See the full description on the dataset page: https://huggingface.co/datasets/behavioralsignals/deepfake-detection-demo.

  5. CVoiceFake-Full ("SafeEar: Content Privacy-Preserving Audio Deepfake...

    • zenodo.org
    bin
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinfeng Li; Xinfeng Li (2024). CVoiceFake-Full ("SafeEar: Content Privacy-Preserving Audio Deepfake Detection", ACM CCS 2024) [Dataset]. http://doi.org/10.5281/zenodo.11229569
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xinfeng Li; Xinfeng Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 7, 2024
    Description

    Introduction:

    CVoiceFake (Full) encompasses five common languages (English, Chinese, German, French, and Italian) and utilizes multi-advanced and classical voice cloning techniques (Parallel WaveGAN, Multi-band MelGAN, Style MelGAN, Griffin-Lim, WORLD, and DiffWave) to produce audio samples that bear a high resemblance to authentic audio.

    1. Parallel WaveGAN: As a non-autoregressive vocoder-based model, Parallel WaveGAN produces high-fidelity audio rapidly, ideal for efficient and quality deepfake generation.
    2. Multi-band MelGAN: Multi-band MelGAN is a variant of MelGAN that divides the frequency spectrum into sub-bands for faster and more stable multi-lingual vocoder training, enhancing the robustness and scalability of the dataset.
    3. Style MelGAN: Style MelGAN is designed to capture fine prosodic and stylistic nuances of speech, making it particularly compelling for deepfake applications that require high levels of expressivity and variation in speech synthesis.
    4. Griffin-Lim: This algorithm reconstructs waveforms from spectrograms using an iterative phase estimation method. Though less high-fidelity than neural vocoders, it serves as a traditional baseline for comparing deepfake generation.
    5. WORLD: WORLD is a statistical parameter-based voice synthesis system that offers fine control over the spectral and prosodic features of the synthesized audio. Its fine manipulation is useful for crafting the nuanced variations needed in deepfake datasets.
    6. We have also built the SOTA diffusion-based deepfake audio (DiffWave); please contact the author at xinfengli@zju.edu.cn if you are interested in the dataset, particularly the DiffWave portion. Furthermore, any additional discussions are welcomed.
      DiffWave: DiffWave is a diffusion probability model for waveform generation. It converts the white noise signal into structured waveform through a Markov chain, capable of both conditional and unconditional generation tasks. DiffWave represents the advanced synthesis method for its fast synthesis speed and high synthesis quality.

    Full Dataset & Project Page:

    The sampled small dataset is available on CVoiceFake Small as well. Please kindly also refer to the project page: SafeEar Website.

    Citation:

    If you find our paper/code/dataset helpful, please kindly consider citing this work with the following reference:

    @inproceedings{li2024safeear,
    author = {Li, Xinfeng and Li, Kai and Zheng, Yifan and Yan, Chen and Ji, Xiaoyu, and Xu, Wenyuan},
    title = {{SafeEar: Content Privacy-Preserving Audio Deepfake Detection}},
    booktitle = {Proceedings of the 2024 {ACM} {SIGSAC} Conference on Computer and Communications Security (CCS)}
    year = {2024},
    }
  6. Z

    ASVspoof 5: Design, Collection and Validation of Resources for Spoofing,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamagishi, Junichi (2025). ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14498690
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Zhu, Ge
    Lee, Kong Aik
    Kinnunen, Tomi
    Zhang, Wangyou
    Muller, Nicolas
    Todisco, Massimiliano
    Kukanov, Ivan
    Wang, Xin
    Sun, Chengzhe
    Yamagishi, Junichi
    Evans, Nicholas
    Sahidullah, Md
    Hou, Shuwei
    Le Maguer, Sebastien
    Liu, Xuechen
    Jeong, Myeonghun
    Zang, Yongyi
    Shim, Hyejin
    Jung, Jee-weon
    Guo, Hanjie
    Lyu, Siwei
    Chen, Liping
    Gong, Cheng
    Singh, Vishwanath
    Lux, Florian
    Maiti, Soumi
    Delgado, HΓ©ctor
    Zhang, Neil
    Tak, Hemlata
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This is the Zenodo repository for the ASVspoof 5 database. ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof~5 database is built from crowdsourced data collected from around 2,000 speakers in diverse acoustic conditions. More than 20 attacks, also crowdsourced, are generated and optionally tested using surrogate detection models, while seven adversarial attacks are incorporated for the first time.

    Please check README.txt and LICENSE.txt before downloading the database.

    Database paper (to be submitted): https://arxiv.org/abs/2502.08857

    Please consider citing the reference listed at the bottom of this page.

    It is highly recommended to follow the rules and instructions in the ASVspoof 5 challenge evaluation plan (phase 2, https://www.asvspoof.org/), if you want to produce results comparable with the literature.

    Latest work using the ASVspoof 5 database can be found in the Automatic Speaker Verification Spoofing Countermeasures Workshop proceeding: https://www.isca-archive.org/asvspoof_2024/index.html

    If you are interested in creating spoofed data for research purpose using the ASVspoof 5 protocol, please send request to info@asvspoof.org

  7. Z

    TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Hosler (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6560158
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Paolo Bestagini
    Davide Salvi
    Brian Hosler
    Matthew C. Stamm
    Stefano Tubaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

    In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

    For the initial version of TIMIT-TTS v1.0

    Arxiv: https://arxiv.org/abs/2209.08000

    TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159

  8. h

    EchoFake

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EchoFake, EchoFake [Dataset]. https://huggingface.co/datasets/EchoFake/EchoFake
    Explore at:
    Authors
    EchoFake
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

    Code for baseline models is available at https://github.com/EchoFake/EchoFake Auto-recording tools is available at https://github.com/EchoFake/EchoFake/tree/main/tools

      Abstract
    

    The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance… See the full description on the dataset page: https://huggingface.co/datasets/EchoFake/EchoFake.

  9. The Fake-or-Real (FoR) Dataset (deepfake audio)

    • kaggle.com
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Abdeldayem (2024). The Fake-or-Real (FoR) Dataset (deepfake audio) [Dataset]. https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammed Abdeldayem
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    The Fake-or-Real (FoR) dataset is a collection of more than 195,000 utterances from real humans and computer generated speech. The dataset can be used to train classifiers to detect synthetic speech.

    The dataset aggregates data from the latest TTS solutions (such as Deep Voice 3 and Google Wavenet TTS) as well as a variety of real human speech, including the Arctic Dataset (http://festvox.org/cmu_arctic/), LJSpeech Dataset (https://keithito.com/LJ-Speech-Dataset/), VoxForge Dataset (http://www.voxforge.org) and our own speech recordings.

    The dataset is published in four versions: for-original, for-norm, for-2sec and for-rerec.

    The first version, named for-original, contains the files as collected from the speech sources, without any modification (balanced version).

    The second version, called for-norm, contains the same files, but balanced in terms of gender and class and normalized in terms of sample rate, volume and number of channels.

    The third one, named for-2sec is based on the second one, but with the files truncated at 2 seconds.

    The last version, named for-rerec, is a rerecorded version of the for-2second dataset, to simulate a scenario where an attacker sends an utterance through a voice channel (i.e. a phone call or a voice message).

  10. h

    SINE

    • huggingface.co
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peaceful Data (2025). SINE [Dataset]. https://huggingface.co/datasets/PeacefulData/SINE
    Explore at:
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Peaceful Data
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SINE Dataset

      Overview
    

    The Speech INfilling Edit (SINE) dataset is a comprehensive collection for speech deepfake detection and audio authenticity verification. This dataset contains ~87GB of audio data distributed across 32 splits, featuring both authentic and synthetically manipulated speech samples.

      Dataset Statistics
    

    Total Size: ~87GB Number of Splits: 32 (split-0.tar.gz to split-31.tar.gz) Audio Format: WAV files Source: Speech edited from LibriLight… See the full description on the dataset page: https://huggingface.co/datasets/PeacefulData/SINE.

  11. f

    Confusion matrix for the binary group responses.

    • plos.figshare.com
    bin
    Updated Aug 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly T. Mai; Sergi Bray; Toby Davies; Lewis D. Griffin (2023). Confusion matrix for the binary group responses. [Dataset]. http://doi.org/10.1371/journal.pone.0285333.t004
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kimberly T. Mai; Sergi Bray; Toby Davies; Lewis D. Griffin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Speech deepfakes are artificial voices generated by machine learning models. Previous literature has highlighted deepfakes as one of the biggest security threats arising from progress in artificial intelligence due to their potential for misuse. However, studies investigating human detection capabilities are limited. We presented genuine and deepfake audio to n = 529 individuals and asked them to identify the deepfakes. We ran our experiments in English and Mandarin to understand if language affects detection performance and decision-making rationale. We found that detection capability is unreliable. Listeners only correctly spotted the deepfakes 73% of the time, and there was no difference in detectability between the two languages. Increasing listener awareness by providing examples of speech deepfakes only improves results slightly. As speech synthesis algorithms improve and become more realistic, we can expect the detection task to become harder. The difficulty of detecting speech deepfakes confirms their potential for misuse and signals that defenses against this threat are needed.

  12. h

    CodecDeepfakeDetection

    • huggingface.co
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Lux (2025). CodecDeepfakeDetection [Dataset]. https://huggingface.co/datasets/Flux9665/CodecDeepfakeDetection
    Explore at:
    Dataset updated
    Sep 20, 2025
    Authors
    Florian Lux
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In this spoof detection dataset, the bonafide speech is resynthesized using various popular neural audio codecs, which are used for compression and low-bandwidth transmission of speech signals. The spoofed speech samples we provide are generated with a selection of popular and well performing language model based speech synthesis methods, which utilize the same codecs as the bonafide audios to obtain discretized speech tokens. This takes the artifacts of the codecs out of the equation and lets… See the full description on the dataset page: https://huggingface.co/datasets/Flux9665/CodecDeepfakeDetection.

  13. ADD 2023 Challenge Track 1.1 Evaluation Dataset

    • zenodo.org
    application/gzip
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiangyan Yi; Chu Yuan Zhang; Jiangyan Yi; Chu Yuan Zhang (2024). ADD 2023 Challenge Track 1.1 Evaluation Dataset [Dataset]. http://doi.org/10.5281/zenodo.12145773
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiangyan Yi; Chu Yuan Zhang; Jiangyan Yi; Chu Yuan Zhang
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description
    Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge (http://addchallenge.cn/add2023) includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks.


    The ADD 2023 dataset is publicly available.

    This data set is licensed with a CC BY-NC-ND 4.0 license.

    If you use this dataset, please cite the following paper:
    Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li: ADD 2023: the Second Audio Deepfake Detection Challenge. DADA@IJCAI 2023: 125-130

  14. Data from: PartialEdit: Identifying Partial Deepfakes in the Era of Neural...

    • zenodo.org
    application/gzip, csv
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    You Zhang; You Zhang; Baotong Tian; Baotong Tian; Lin Zhang; Lin Zhang; Zhiyao Duan; Zhiyao Duan (2025). PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing [Dataset]. http://doi.org/10.5281/zenodo.15519188
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    You Zhang; You Zhang; Baotong Tian; Baotong Tian; Lin Zhang; Lin Zhang; Zhiyao Duan; Zhiyao Duan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is part of the dataset we curated based on VCTK to study partial speech deepfake detection in the era of neural speech editing. For more details, please refer to our Interspeech 2025 paper: "PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing".

    In the paper, we curated four subsets: E1: VoiceCraft, E2: SSR-Speech, E3: Audiobox-Speech, and E4: Audiobox. Adhering to Audiobox's license, we cannot release the E3 and E4 subsets.

    The folder structure is as follows:

    PartialEdit/
    β”œβ”€β”€ PartialEdit_E1E2.csv
    β”œβ”€β”€ E1/
    β”‚ β”œβ”€β”€ p225/
    β”‚ β”‚ β”œβ”€β”€ p225_001_edited_partial_16k.wav
    β”‚ β”‚ β”œβ”€β”€ p225_002_edited_partial_16k.wav
    β”‚ β”‚ └── ...
    β”‚ β”œβ”€β”€ p231/
    β”‚ β”‚ β”œβ”€β”€ p231_001_edited_partial_16k.wav
    β”‚ β”‚ β”œβ”€β”€ p231_002_edited_partial_16k.wav
    β”‚ β”‚ └── ...
    β”‚ └── ...
    β”œβ”€β”€ E1-Codec/
    β”‚ └── (same structure as E1)
    β”œβ”€β”€ E2/
    β”‚ └── (same structure as E1)
    β”œβ”€β”€ E2-Codec/
    β”‚ └── (same structure as E1)
    └── modified_txt/
    β”œβ”€β”€ p225/
    β”‚ β”œβ”€β”€ p225_001_modified.txt
    β”‚ β”œβ”€β”€ p225_002_modified.txt
    β”‚ β”œβ”€β”€ p225_003_modified.txt
    β”‚ └── ...
    β”œβ”€β”€ p231/
    β”‚ β”œβ”€β”€ p231_001_modified.txt
    β”‚ β”œβ”€β”€ p231_002_modified.txt
    β”‚ └── ...
    └── ...

    This is version 1.0, and we will include links to the paper and demo page soon.

    The `PartialEdit_E1E2.csv` file contains information about the edited regions in each audio file. Each row represents the following columns:

    - `filename`: The name of the audio file.
    - `start of the edited region (s)`: The starting time (in seconds) of the first edited region.
    - `end of the edited region (s)`: The ending time (in seconds) of the first edited region.
    - `total duration (s)`: The total duration (in seconds) of the audio file.

    If there are two edited regions within a file, the row format expands to include:

    - `filename`: The name of the audio file.
    - `start of the edited region (s)`: The starting time (in seconds) of the first edited region.
    - `end of the edited region (s)`: The ending time (in seconds) of the first edited region.
    - `start of the second edited region (s)`: The starting time (in seconds) of the second edited region.
    - `end of the second edited region (s)`: The ending time (in seconds) of the second edited region.
    - `total duration (s)`: The total duration (in seconds) of the audio file.

    To make sure the download is complete, you can check the MD5 code with the following command:

    md5sum *

  15. A

    AI Voice Changer Tool Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). AI Voice Changer Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-voice-changer-tool-494160
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Aug 1, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The AI voice changer tool market is experiencing robust growth, projected to reach $237 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 9.6% from 2025 to 2033. This expansion is driven by several key factors. The increasing demand for personalized and accessible content creation across various sectors, including entertainment, education, and accessibility solutions, fuels market adoption. Advances in artificial intelligence, specifically in natural language processing and speech synthesis, are continuously improving the quality and realism of AI-generated voices, making them more appealing to both individual users and businesses. Furthermore, the rising affordability and ease of access to AI voice changer tools through cloud-based platforms and user-friendly software are broadening the market's reach. The market is also being shaped by trends towards greater voice-based interaction in applications and the increasing need for efficient and cost-effective audio production. Despite these positive trends, the market faces certain restraints. Concerns regarding ethical implications, particularly regarding potential misuse for malicious purposes like deepfakes, represent a significant challenge. The market also needs to overcome technological limitations in perfectly replicating nuanced human speech patterns and emotions. Addressing these challenges through technological advancements and robust ethical guidelines will be crucial for the sustainable and responsible growth of the AI voice changer tool market. Competition among numerous players such as Voice-Swap, Clipchamp, Lovo.ai, Speechify, PlayHT, Murf, Synthesys, VocaliD, Respeecher, Speechelo, Wavve, Altered, Listnr AI, and ReadSpeaker will further influence market dynamics. The market segmentation, while not explicitly provided, can be logically inferred as encompassing different pricing tiers, software vs. cloud-based solutions, and specific application areas (e.g., gaming, e-learning).

  16. h

    CodecFake_wavs

    • huggingface.co
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan Tseng (2024). CodecFake_wavs [Dataset]. https://huggingface.co/datasets/rogertseng/CodecFake_wavs
    Explore at:
    Dataset updated
    Aug 2, 2024
    Authors
    Yuan Tseng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

    Paper,
    Code,
    Project Page
    
    
    Interspeech 2024
    

    TL;DR: We show that better detection of deepfake speech from codec-based TTS systems can be achieved by training models on speech re-synthesized with neural audio codecs. This dataset is released for this purpose. See our paper and Github for more details on using our dataset.

      Acknowledgement… See the full description on the dataset page: https://huggingface.co/datasets/rogertseng/CodecFake_wavs.
    
  17. FAD: A Chinese Dataset for Fake Audio Detection

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoxin Ma; Jiangyan Yi; Haoxin Ma; Jiangyan Yi (2023). FAD: A Chinese Dataset for Fake Audio Detection [Dataset]. http://doi.org/10.5281/zenodo.6635521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Haoxin Ma; Jiangyan Yi; Haoxin Ma; Jiangyan Yi
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Fake audio detection is a growing concern and some relevant datasets have been designed for research. But there is no standard public Chinese dataset under additive noise conditions. In this paper, we aim to fill in the gap and design a
    Chinese fake audio detection dataset (FAD) for studying more generalized detection methods. Twelve mainstream speech generation techniques are used to generate fake audios. To simulate the real-life scenarios, three noise datasets are selected for
    noisy adding at five different signal noise ratios. FAD dataset can be used not only for fake audio detection, but also for detecting the algorithms of fake utterances for
    audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging.
    The FAD dataset is publicly available. The source code of baselines is available on GitHub https://github.com/ADDchallenge/FAD


    The FAD dataset is designed to evaluate the methods of fake audio detection and fake algorithms recognition and other relevant studies. To better study the robustness of the methods under noisy
    conditions when applied in real life, we construct the corresponding noisy dataset. The total FAD dataset consists of two versions: clean version and noisy version. Both versions are divided into
    disjoint training, development and test sets in the same way. There is no speaker overlap across these three subsets. Each test sets is further divided into seen and unseen test sets. Unseen test sets can
    evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audios and fake audios in the unseen test set are unknown to the model.
    For the noisy speech part, we select three noise database for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the
    remaining subsets come from different noise databases. In each version of FAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set. More detailed statistics are demonstrated in the Tabel 2.

    Clean Real Audios Collection
    From the point of eliminating the interference of irrelevant factors, we collect clean real audios from
    two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset.

    Clean Fake Audios Generation
    We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios.

    Noisy Audios Simulation
    Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different
    SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes.

    This data set is licensed with a CC BY-NC-ND 4.0 license.
    You can cite the data using the following BibTeX entry:
    @inproceedings{ma2022fad,
    title={FAD: A Chinese Dataset for Fake Audio Detection},
    author={Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xunrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Le Xu, Ruibo Fu},
    booktitle={Submitted to the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks },
    year={2022},
    }

  18. h

    famousfigures

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Information Systems, Security and Forensics Lab, famousfigures [Dataset]. https://huggingface.co/datasets/issf/famousfigures
    Explore at:
    Dataset authored and provided by
    Information Systems, Security and Forensics Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Famous Figures: Collecting, Curating, and Annotating Good Quality Speech Deepfake Dataset for Famous Figures β€” Process and Challenges

    Paper, Project Page

    Authors Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik

    Interspeech 2025

    Abstract Current audio deepfake detection systems fail to protect specific individuals from targeted voice spoofing attacks.A comprehensive methodology for collecting, curating, and generating high-quality speech… See the full description on the dataset page: https://huggingface.co/datasets/issf/famousfigures.

  19. Z

    FaciaVox a Multimodal Biometric Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abuqaaud, Kamal (2025). FaciaVox a Multimodal Biometric Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14861091
    Explore at:
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Bou Nassif, Ali
    Shahin, Ismail
    Abuqaaud, Kamal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The FaciaVox dataset is an extensive multimodal biometric resource designed to enable in-depth exploration of face-image and voice recording research areas in both masked and unmasked scenarios.

    Features of the Dataset:

    1. Multimodal Data: A total of 1,800 face images (JPG) and 6,000 audio recordings (WAV) were collected, enabling cross-domain analysis of visual and auditory biometrics.

    2. Participants were categorized into four age groups for structured labeling:Label 1: Under 16 yearsLabel 2: 16 to less than 31 yearsLabel 3: 31 to less than 46 yearsLabel 4: 46 years and above

    3. Sibling Data: Some participants are siblings, adding a challenging layer for speaker identification and facial recognition tasks due to genetic similarities in vocal and facial features. Sibling relationships are documented in the accompanying "FaciaVox List" data file.

    4. Standardized Filenames: The dataset uses a consistent, intuitive naming convention for both facial images and voice recordings. Each filename includes:Type (F: Face Image, V: Voice Recording)Participant ID (e.g., sub001)Mask Type (e.g., a: unmasked, b: disposable mask, etc.)Zoom Level or Sentence ID (e.g., 1x, 3x, 5x for images or specific sentence identifier {01, 02, 03, ..., 10} for recordings)

    5. Diverse Demographics: 19 different countries.

    6. A challenging face recognition problem involving reflective mask shields and severe lighting conditions.

    7. Each participant uttered 7 English statements and 3 Arabic statements, regardless of their native language. This adds a challenge for speaker identification.

    Research Applications

    FaciaVox is a versatile dataset supporting a wide range of research domains, including but not limited to:β€’ Speaker Identification (SI) and Face Recognition (FR): Evaluating biometric systems under varying conditions.β€’ Impact of Masks on Biometrics: Investigating how different facial coverings affect recognition performance.β€’ Language Impact on SI: Exploring the effects of native and non-native speech on speaker identification.β€’ Age and Gender Estimation: Inferring demographic information from voice and facial features.β€’ Race and Ethnicity Matching: Studying biometrics across diverse populations.β€’ Synthetic Voice and Deepfake Detection: Detecting cloned or generated speech.β€’ Cross-Domain Biometric Fusion: Combining facial and vocal data for robust authentication.β€’ Speech Intelligibility: Assessing how masks influence speech clarity.β€’ Image Inpainting: Reconstructing occluded facial regions for improved recognition.

    Researchers can use the facial images and voice recordings independently or in combination to explore multimodal biometric systems. The standardized filenames and accompanying metadata make it easy to align visual and auditory data for cross-domain analyses. Sibling relationships and demographic labels add depth for tasks such as familial voice recognition, demographic profiling, and model bias evaluation.

  20. ADD 2023 Challenge Track 3 Training/Development Dataset

    • zenodo.org
    application/gzip, bin
    Updated Jul 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiangyan Yi; Chu Yuan Zhang; Jiangyan Yi; Chu Yuan Zhang (2024). ADD 2023 Challenge Track 3 Training/Development Dataset [Dataset]. http://doi.org/10.5281/zenodo.12179632
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiangyan Yi; Chu Yuan Zhang; Jiangyan Yi; Chu Yuan Zhang
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description
    Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge (http://addchallenge.cn/add2023) includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks.


    The ADD 2023 dataset is publicly available.

    This data set is licensed with a CC BY-NC-ND 4.0 license.

    If you use this dataset, please cite the following paper:
    Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li: ADD 2023: the Second Audio Deepfake Detection Challenge. DADA@IJCAI 2023: 125-130

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jordan J. Bird (2023). DEEP-VOICE: DeepFake Voice Recognition [Dataset]. https://www.kaggle.com/datasets/birdy654/deep-voice-deepfake-voice-recognition
Organization logo

DEEP-VOICE: DeepFake Voice Recognition

Using machine learning to detect when speech is AI-Generated

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jordan J. Bird
Description

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction

There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&alt=media" alt="Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.">

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset

There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

**Note: ** All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F7ae536243464f0dbb48f3566765f6b50%2Fdfcover.png?generation=1692897790677119&alt=media" alt="Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.">

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Papers with Code

The dataset and all studies using it are linked using Papers with Code

The Papers with Code page can be found by clicking here: Papers with Code

Attribution

This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

Bird, J.J. and Lotfi, A., 2023. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License

This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

*THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT H...

Search
Clear search
Close search
Google apps
Main menu