This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.
Can machine learning be used to detect when speech is AI-generated?
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.
To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.
For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&alt=media" alt="Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.">
(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)
There are two forms to the dataset that are made available.
First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.
Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.
**Note: ** All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.
A potential use of a successful system could be used for the following:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F7ae536243464f0dbb48f3566765f6b50%2Fdfcover.png?generation=1692897790677119&alt=media" alt="Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.">
(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)
The dataset and all studies using it are linked using Papers with Code
The Papers with Code page can be found by clicking here: Papers with Code
This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"
Bird, J.J. and Lotfi, A., 2023. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.
The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion
This dataset is provided under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT H...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The main purpose of this data set is to facilitate research into audio DeepFakes. We hope that this work helps in finding new detection methods to prevent such attempts. These generated media files have been increasingly used to commit impersonation attempts or online harassment.
The data set consists of 104,885 generated audio clips (16-bit PCM wav). We examine multiple networks trained on two reference data sets. First, the LJSpeech data set consisting of 13,100 short audio clips (on average 6 seconds each; roughly 24 hours total) read by a female speaker. It features passages from 7 non-fiction books and the audio was recorded on a MacBook Pro microphone. Second, we include samples based on the JSUT data set, specifically, basic5000 corpus. This corpus consists of 5,000 sentences covering all basic kanji of the Japanese language (4.8 seconds on average; roughly 6.7 hours total). The recordings were performed by a female native Japanese speaker in an anechoic room. Finally, we include samples from a full text-to-speech pipeline (16,283 phrases; 3.8s on average; roughly 17.5 hours total). Thus, our data set consists of approximately 175 hours of generated audio files in total. Note that we do not redistribute the reference data.
We included a range of architectures in our data set:
MelGAN
Parallel WaveGAN
Multi-Band MelGAN
Full-Band MelGAN
WaveGlow
Additionally, we examined a bigger version of MelGAN and include samples from a full TTS-pipeline consisting of a conformer and parallel WaveGAN model.
Collection Process
For WaveGlow, we utilize the official implementation (commit 8afb643) in conjunction with the official pre-trained network on PyTorch Hub. We use a popular implementation available on GitHub (commit 12c677e) for the remaining networks. The repository also offers pre-trained models. We used the pre-trained networks to generate samples that are similar to their respective training distributions, LJ Speech and JSUT. When sampling the data set, we first extract Mel spectrograms from the original audio files, using the pre-processing scripts of the corresponding repositories. We then feed these Mel spectrograms to the respective models to obtain the data set. For sampling the full TTS results, we use the ESPnet project. To make sure the generated phrases do not overlap with the training set, we downloaded the common voices data set and extracted 16.285 phrases from it.
This data set is licensed with a CC-BY-SA 4.0 license.
This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -- EXC-2092 CaSa -- 390781972.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Deepfake cross-lingual evaluation dataset (DECRO) is constructed to evaluate the influence of language differences on deepfake detection.
If you use DECRO dataset for deepfake detection, please cite the paper "Transferring Audio Deepfake Detection Capability across Languages" published in www'23.
Deepfake Detection Demo
This is a demo evaluation dataset for the task of Deepfake Detection on human speech. This dataset has been created to demonstate the capabalities of Behavioral Signals API.
Information
The dataset contains 22 utterances, containg an equal amount of genuine ("bonafide") and fake ("spoofed") utterances.Utterances from the "bonafide" class have been sourced from the test set of CommonVoice-17.0 corpus.The "deepfake" utterances have been cloned⦠See the full description on the dataset page: https://huggingface.co/datasets/behavioralsignals/deepfake-detection-demo.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CVoiceFake (Full) encompasses five common languages (English, Chinese, German, French, and Italian) and utilizes multi-advanced and classical voice cloning techniques (Parallel WaveGAN, Multi-band MelGAN, Style MelGAN, Griffin-Lim, WORLD, and DiffWave) to produce audio samples that bear a high resemblance to authentic audio.
xinfengli@zju.edu.cn
if you are interested in the dataset, particularly the DiffWave portion. Furthermore, any additional discussions are welcomed.
The sampled small dataset is available on CVoiceFake Small as well. Please kindly also refer to the project page: SafeEar Website.
If you find our paper/code/dataset helpful, please kindly consider citing this work with the following reference:
@inproceedings{li2024safeear,
author = {Li, Xinfeng and Li, Kai and Zheng, Yifan and Yan, Chen and Ji, Xiaoyu, and Xu, Wenyuan},
title = {{SafeEar: Content Privacy-Preserving Audio Deepfake Detection}},
booktitle = {Proceedings of the 2024 {ACM} {SIGSAC} Conference on Computer and Communications Security (CCS)}
year = {2024},
}
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This is the Zenodo repository for the ASVspoof 5 database. ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof~5 database is built from crowdsourced data collected from around 2,000 speakers in diverse acoustic conditions. More than 20 attacks, also crowdsourced, are generated and optionally tested using surrogate detection models, while seven adversarial attacks are incorporated for the first time.
Please check README.txt and LICENSE.txt before downloading the database.
Database paper (to be submitted): https://arxiv.org/abs/2502.08857
Please consider citing the reference listed at the bottom of this page.
It is highly recommended to follow the rules and instructions in the ASVspoof 5 challenge evaluation plan (phase 2, https://www.asvspoof.org/), if you want to produce results comparable with the literature.
Latest work using the ASVspoof 5 database can be found in the Automatic Speaker Verification Spoofing Countermeasures Workshop proceeding: https://www.isca-archive.org/asvspoof_2024/index.html
If you are interested in creating spoofed data for research purpose using the ASVspoof 5 protocol, please send request to info@asvspoof.org
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.
In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.
For the initial version of TIMIT-TTS v1.0
Arxiv: https://arxiv.org/abs/2209.08000
TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
Code for baseline models is available at https://github.com/EchoFake/EchoFake Auto-recording tools is available at https://github.com/EchoFake/EchoFake/tree/main/tools
Abstract
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance⦠See the full description on the dataset page: https://huggingface.co/datasets/EchoFake/EchoFake.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
The Fake-or-Real (FoR) dataset is a collection of more than 195,000 utterances from real humans and computer generated speech. The dataset can be used to train classifiers to detect synthetic speech.
The dataset aggregates data from the latest TTS solutions (such as Deep Voice 3 and Google Wavenet TTS) as well as a variety of real human speech, including the Arctic Dataset (http://festvox.org/cmu_arctic/), LJSpeech Dataset (https://keithito.com/LJ-Speech-Dataset/), VoxForge Dataset (http://www.voxforge.org) and our own speech recordings.
The dataset is published in four versions: for-original, for-norm, for-2sec and for-rerec.
The first version, named for-original, contains the files as collected from the speech sources, without any modification (balanced version).
The second version, called for-norm, contains the same files, but balanced in terms of gender and class and normalized in terms of sample rate, volume and number of channels.
The third one, named for-2sec is based on the second one, but with the files truncated at 2 seconds.
The last version, named for-rerec, is a rerecorded version of the for-2second dataset, to simulate a scenario where an attacker sends an utterance through a voice channel (i.e. a phone call or a voice message).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SINE Dataset
Overview
The Speech INfilling Edit (SINE) dataset is a comprehensive collection for speech deepfake detection and audio authenticity verification. This dataset contains ~87GB of audio data distributed across 32 splits, featuring both authentic and synthetically manipulated speech samples.
Dataset Statistics
Total Size: ~87GB Number of Splits: 32 (split-0.tar.gz to split-31.tar.gz) Audio Format: WAV files Source: Speech edited from LibriLight⦠See the full description on the dataset page: https://huggingface.co/datasets/PeacefulData/SINE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Speech deepfakes are artificial voices generated by machine learning models. Previous literature has highlighted deepfakes as one of the biggest security threats arising from progress in artificial intelligence due to their potential for misuse. However, studies investigating human detection capabilities are limited. We presented genuine and deepfake audio to n = 529 individuals and asked them to identify the deepfakes. We ran our experiments in English and Mandarin to understand if language affects detection performance and decision-making rationale. We found that detection capability is unreliable. Listeners only correctly spotted the deepfakes 73% of the time, and there was no difference in detectability between the two languages. Increasing listener awareness by providing examples of speech deepfakes only improves results slightly. As speech synthesis algorithms improve and become more realistic, we can expect the detection task to become harder. The difficulty of detecting speech deepfakes confirms their potential for misuse and signals that defenses against this threat are needed.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In this spoof detection dataset, the bonafide speech is resynthesized using various popular neural audio codecs, which are used for compression and low-bandwidth transmission of speech signals. The spoofed speech samples we provide are generated with a selection of popular and well performing language model based speech synthesis methods, which utilize the same codecs as the bonafide audios to obtain discretized speech tokens. This takes the artifacts of the codecs out of the equation and lets⦠See the full description on the dataset page: https://huggingface.co/datasets/Flux9665/CodecDeepfakeDetection.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li: ADD 2023: the Second Audio Deepfake Detection Challenge. DADA@IJCAI 2023: 125-130
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is part of the dataset we curated based on VCTK to study partial speech deepfake detection in the era of neural speech editing. For more details, please refer to our Interspeech 2025 paper: "PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing".
In the paper, we curated four subsets: E1: VoiceCraft, E2: SSR-Speech, E3: Audiobox-Speech, and E4: Audiobox. Adhering to Audiobox's license, we cannot release the E3 and E4 subsets.
The folder structure is as follows:
PartialEdit/
βββ PartialEdit_E1E2.csv
βββ E1/
β βββ p225/
β β βββ p225_001_edited_partial_16k.wav
β β βββ p225_002_edited_partial_16k.wav
β β βββ ...
β βββ p231/
β β βββ p231_001_edited_partial_16k.wav
β β βββ p231_002_edited_partial_16k.wav
β β βββ ...
β βββ ...
βββ E1-Codec/
β βββ (same structure as E1)
βββ E2/
β βββ (same structure as E1)
βββ E2-Codec/
β βββ (same structure as E1)
βββ modified_txt/
βββ p225/
β βββ p225_001_modified.txt
β βββ p225_002_modified.txt
β βββ p225_003_modified.txt
β βββ ...
βββ p231/
β βββ p231_001_modified.txt
β βββ p231_002_modified.txt
β βββ ...
βββ ...
This is version 1.0, and we will include links to the paper and demo page soon.
The `PartialEdit_E1E2.csv` file contains information about the edited regions in each audio file. Each row represents the following columns:
- `filename`: The name of the audio file.
- `start of the edited region (s)`: The starting time (in seconds) of the first edited region.
- `end of the edited region (s)`: The ending time (in seconds) of the first edited region.
- `total duration (s)`: The total duration (in seconds) of the audio file.
If there are two edited regions within a file, the row format expands to include:
- `filename`: The name of the audio file.
- `start of the edited region (s)`: The starting time (in seconds) of the first edited region.
- `end of the edited region (s)`: The ending time (in seconds) of the first edited region.
- `start of the second edited region (s)`: The starting time (in seconds) of the second edited region.
- `end of the second edited region (s)`: The ending time (in seconds) of the second edited region.
- `total duration (s)`: The total duration (in seconds) of the audio file.
To make sure the download is complete, you can check the MD5 code with the following command:
md5sum *
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The AI voice changer tool market is experiencing robust growth, projected to reach $237 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 9.6% from 2025 to 2033. This expansion is driven by several key factors. The increasing demand for personalized and accessible content creation across various sectors, including entertainment, education, and accessibility solutions, fuels market adoption. Advances in artificial intelligence, specifically in natural language processing and speech synthesis, are continuously improving the quality and realism of AI-generated voices, making them more appealing to both individual users and businesses. Furthermore, the rising affordability and ease of access to AI voice changer tools through cloud-based platforms and user-friendly software are broadening the market's reach. The market is also being shaped by trends towards greater voice-based interaction in applications and the increasing need for efficient and cost-effective audio production. Despite these positive trends, the market faces certain restraints. Concerns regarding ethical implications, particularly regarding potential misuse for malicious purposes like deepfakes, represent a significant challenge. The market also needs to overcome technological limitations in perfectly replicating nuanced human speech patterns and emotions. Addressing these challenges through technological advancements and robust ethical guidelines will be crucial for the sustainable and responsible growth of the AI voice changer tool market. Competition among numerous players such as Voice-Swap, Clipchamp, Lovo.ai, Speechify, PlayHT, Murf, Synthesys, VocaliD, Respeecher, Speechelo, Wavve, Altered, Listnr AI, and ReadSpeaker will further influence market dynamics. The market segmentation, while not explicitly provided, can be logically inferred as encompassing different pricing tiers, software vs. cloud-based solutions, and specific application areas (e.g., gaming, e-learning).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems
Paper,
Code,
Project Page
Interspeech 2024
TL;DR: We show that better detection of deepfake speech from codec-based TTS systems can be achieved by training models on speech re-synthesized with neural audio codecs. This dataset is released for this purpose. See our paper and Github for more details on using our dataset.
Acknowledgement⦠See the full description on the dataset page: https://huggingface.co/datasets/rogertseng/CodecFake_wavs.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Fake audio detection is a growing concern and some relevant datasets have been designed for research. But there is no standard public Chinese dataset under additive noise conditions. In this paper, we aim to fill in the gap and design a
Chinese fake audio detection dataset (FAD) for studying more generalized detection methods. Twelve mainstream speech generation techniques are used to generate fake audios. To simulate the real-life scenarios, three noise datasets are selected for
noisy adding at five different signal noise ratios. FAD dataset can be used not only for fake audio detection, but also for detecting the algorithms of fake utterances for
audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging.
The FAD dataset is publicly available. The source code of baselines is available on GitHub https://github.com/ADDchallenge/FAD
The FAD dataset is designed to evaluate the methods of fake audio detection and fake algorithms recognition and other relevant studies. To better study the robustness of the methods under noisy
conditions when applied in real life, we construct the corresponding noisy dataset. The total FAD dataset consists of two versions: clean version and noisy version. Both versions are divided into
disjoint training, development and test sets in the same way. There is no speaker overlap across these three subsets. Each test sets is further divided into seen and unseen test sets. Unseen test sets can
evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audios and fake audios in the unseen test set are unknown to the model.
For the noisy speech part, we select three noise database for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the
remaining subsets come from different noise databases. In each version of FAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set. More detailed statistics are demonstrated in the Tabel 2.
Clean Real Audios Collection
From the point of eliminating the interference of irrelevant factors, we collect clean real audios from
two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset.
Clean Fake Audios Generation
We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios.
Noisy Audios Simulation
Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different
SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes.
This data set is licensed with a CC BY-NC-ND 4.0 license.
You can cite the data using the following BibTeX entry:
@inproceedings{ma2022fad,
title={FAD: A Chinese Dataset for Fake Audio Detection},
author={Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xunrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Le Xu, Ruibo Fu},
booktitle={Submitted to the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks },
year={2022},
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Famous Figures: Collecting, Curating, and Annotating Good Quality Speech Deepfake Dataset for Famous Figures β Process and Challenges
Paper, Project Page
Authors Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik
Interspeech 2025
Abstract Current audio deepfake detection systems fail to protect specific individuals from targeted voice spoofing attacks.A comprehensive methodology for collecting, curating, and generating high-quality speech⦠See the full description on the dataset page: https://huggingface.co/datasets/issf/famousfigures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The FaciaVox dataset is an extensive multimodal biometric resource designed to enable in-depth exploration of face-image and voice recording research areas in both masked and unmasked scenarios.
Features of the Dataset:
Multimodal Data: A total of 1,800 face images (JPG) and 6,000 audio recordings (WAV) were collected, enabling cross-domain analysis of visual and auditory biometrics.
Participants were categorized into four age groups for structured labeling:Label 1: Under 16 yearsLabel 2: 16 to less than 31 yearsLabel 3: 31 to less than 46 yearsLabel 4: 46 years and above
Sibling Data: Some participants are siblings, adding a challenging layer for speaker identification and facial recognition tasks due to genetic similarities in vocal and facial features. Sibling relationships are documented in the accompanying "FaciaVox List" data file.
Standardized Filenames: The dataset uses a consistent, intuitive naming convention for both facial images and voice recordings. Each filename includes:Type (F: Face Image, V: Voice Recording)Participant ID (e.g., sub001)Mask Type (e.g., a: unmasked, b: disposable mask, etc.)Zoom Level or Sentence ID (e.g., 1x, 3x, 5x for images or specific sentence identifier {01, 02, 03, ..., 10} for recordings)
Diverse Demographics: 19 different countries.
A challenging face recognition problem involving reflective mask shields and severe lighting conditions.
Each participant uttered 7 English statements and 3 Arabic statements, regardless of their native language. This adds a challenge for speaker identification.
Research Applications
FaciaVox is a versatile dataset supporting a wide range of research domains, including but not limited to:β’ Speaker Identification (SI) and Face Recognition (FR): Evaluating biometric systems under varying conditions.β’ Impact of Masks on Biometrics: Investigating how different facial coverings affect recognition performance.β’ Language Impact on SI: Exploring the effects of native and non-native speech on speaker identification.β’ Age and Gender Estimation: Inferring demographic information from voice and facial features.β’ Race and Ethnicity Matching: Studying biometrics across diverse populations.β’ Synthetic Voice and Deepfake Detection: Detecting cloned or generated speech.β’ Cross-Domain Biometric Fusion: Combining facial and vocal data for robust authentication.β’ Speech Intelligibility: Assessing how masks influence speech clarity.β’ Image Inpainting: Reconstructing occluded facial regions for improved recognition.
Researchers can use the facial images and voice recordings independently or in combination to explore multimodal biometric systems. The standardized filenames and accompanying metadata make it easy to align visual and auditory data for cross-domain analyses. Sibling relationships and demographic labels add depth for tasks such as familial voice recognition, demographic profiling, and model bias evaluation.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li: ADD 2023: the Second Audio Deepfake Detection Challenge. DADA@IJCAI 2023: 125-130
This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.
Can machine learning be used to detect when speech is AI-generated?
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.
To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.
For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&alt=media" alt="Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.">
(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)
There are two forms to the dataset that are made available.
First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.
Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.
**Note: ** All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.
A potential use of a successful system could be used for the following:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F7ae536243464f0dbb48f3566765f6b50%2Fdfcover.png?generation=1692897790677119&alt=media" alt="Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.">
(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)
The dataset and all studies using it are linked using Papers with Code
The Papers with Code page can be found by clicking here: Papers with Code
This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"
Bird, J.J. and Lotfi, A., 2023. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.
The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion
This dataset is provided under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT H...