100+ datasets found

P
DEEP-VOICE: DeepFake Voice Recognition Dataset
paperswithcode.com
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition
Explore at:
Dataset updated
Aug 23, 2023
Description
DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Kaggle The dataset is available on the Kaggle data science platform.

The Kaggle page can be found by clicking here: Dataset on Kaggle

Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Gender Recognition by Voice(original)
kaggle.com
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
murtadha najim (2025). Gender Recognition by Voice(original) [Dataset]. https://www.kaggle.com/datasets/murtadhanajim/gender-recognition-by-voiceoriginal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
murtadha najim
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains WAV audio files for gender classification based on voice. The data is organized into two folders:

Female: Contains 5,768 audio files of female voices.

*Male: *Contains 10,400 audio files of male voices.

Additional Information:

*Source: *The data was obtained online (link provided in the references section).

Duplicates: Approximately 1,000 files are duplicated. Ensure to check and handle duplicates when processing the dataset.

Processed Version: A cleaned and processed version of the dataset is available at this link for convenience.

Code: The code used for processing this dataset can be found in the Code section.

Key Features:

Format: All files are in .wav format.

length : around 1.5 to 5 seconds with average of 3.26s

Purpose: Designed for building machine learning models that classify gender based on audio data.

Applications: Useful in speech recognition systems, voice assistant training, and audio analysis.

Usage Notes:

This dataset is great for developing machine learning algorithms in:

Speech processing

Gender classification

Signal analysis
o
Arabic Speech Commands Dataset
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulkader Ghandoura (2021). Arabic Speech Commands Dataset [Dataset]. http://doi.org/10.5281/zenodo.4662480
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4662480
Dataset updated
Apr 5, 2021
Authors
Abdulkader Ghandoura
Description
Arabic Speech Commands Dataset This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks. Dataset Description Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB. Dataset Structure There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time. Data Split We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset. License This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder. Citations If you want to use the Arabic Speech Commands dataset in your work, please cite it as: @article{arabicspeechcommandsv1, author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma}, title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting}, journal = {Engineering Applications of Artificial Intelligence}, year = {2021}, publisher={Elsevier} }
m
Dataset for acoustic features extracted from voice recordings of people...
data.mendeley.com
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Carrón (2023). Dataset for acoustic features extracted from voice recordings of people suffering Parkinson’s disease and healthy subjects [Dataset]. http://doi.org/10.17632/fjd6fcfkwn.1
Explore at:
Unique identifier
https://doi.org/10.17632/fjd6fcfkwn.1
Dataset updated
Jan 26, 2023
Authors
Javier Carrón
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains acoustic features extracted from people suffering Parkinson’s disease and healthy subjects.

All the information on data collection and dataset building is presented in the following paper. This paper is encouraged to be cited in case of any scientific research publication is produced using this dataset:

Carrón, J., Campos-Roca, Y., Madruga, M. et al. A mobile-assisted voice condition analysis system for Parkinson’s disease: assessment of usability conditions. BioMed Eng OnLine 20, 114 (2021). https://doi.org/10.1186/s12938-021-00951-y

All the data is included in the .csv file. A brief description of attributes can be found in the .pdf file.

The research leading to this dataset has been funded by Agencia Estatal de Investigación, Ministerio de Ciencia e Innovación, Spain (Project MTM2017-86875-C3-2-R) and the dataset is part of the project PID2021-122209OB-C32. It has also been funded by Junta de Extremadura, Spain (Projects GR18108 and GR18055), the European Union (European Regional Development Funds), and grant number FPU18/03274.
h
japanese-voice-combined
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kadir Nar (2025). japanese-voice-combined [Dataset]. https://huggingface.co/datasets/kadirnar/japanese-voice-combined
Explore at:
Dataset updated
Apr 13, 2025
Authors
Kadir Nar
Description
Japanese Voice Dataset Combined

This dataset combines multiple high-quality Japanese voice datasets to create a comprehensive collection of Japanese speech data. It's designed to be used for speech recognition, text-to-speech, and other speech-related machine learning tasks.

Dataset Content

This dataset contains over 86,000 audio samples in Japanese from various high-quality sources. The data comes from the following sources:

Dataset Estimated Samples

StoryTTS… See the full description on the dataset page: https://huggingface.co/datasets/kadirnar/japanese-voice-combined.
u
Russian Speech Recognition Dataset
unidata.pro
wav
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata L.L.C-FZ, Russian Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/russian-speech-recognition-dataset/
Explore at:
wavAvailable download formats
Dataset authored and provided by
Unidata L.L.C-FZ
Description
Unidata provides a Russian Speech Recognition dataset to train AI for seamless speech-to-text conversion
m
A kiswahili Dataset for Development of Text-To-Speech System
data.mendeley.com
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
Explore at:
Unique identifier
https://doi.org/10.17632/vbvj6j6pm9.1
Dataset updated
Nov 30, 2021
Authors
Kiptoo Rono
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.
Z
Persian Speech to Test dataset
data.niaid.nih.gov
Updated Dec 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyyed Mohammad Masoud Parpanchi (2022). Persian Speech to Test dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7486153
Explore at:
Dataset updated
Dec 28, 2022
Dataset authored and provided by
Seyyed Mohammad Masoud Parpanchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Persian Speech to Text dataset is an open source dataset for training machine learning models for the task of transcribing audio files in the Persian language into text. It is the largest open source dataset of its kind, with a size of approximately 60GB of data. The dataset consists of audio files in the WAV format and their transcripts in CSV file format. This dataset is a valuable resource for researchers and developers working on natural language processing tasks involving the Persian language, and it provides a large and diverse set of data to train and evaluate machine learning models on. The open source nature of the dataset means that it is freely available to be used and modified by anyone, making it an important resource for advancing research and development in the field.
Z
Axiom voice recognition dataset
data.niaid.nih.gov
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Ermini (2024). Axiom voice recognition dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1218978
Explore at:
Dataset updated
Aug 2, 2024
Dataset provided by
Nicola Bettin
Antonio Rizzo
Sara Ermini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The AXIOM Voice Dataset has the main purpose of gathering audio recordings from Italian natural language speakers. This voice data collection intended to obtain audio reconding sample for the training and testing of VIMAR algorithm implemented for the Smart Home scenario for the Axiom board. The final goal was to developing an efficient voice recognition system using machine learning algorithms. A team of UX researchers of the University of Siena collected data for five months and tested the voice recognition system on the AXIOM board [1]. The data acquisition process involved natural Italian speakers who provided their written consent to participate in the research project. The participants were selected in order to maintain a cluster with different characteristics in gender, age, region of origin and background.
P
SpeakingFaces Dataset
paperswithcode.com
opendatalab.com
Updated Apr 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madina Abdrakhmanova; Askat Kuzdeuov; Sheikh Jarju; Yerbolat Khassanov; Michael Lewis; Huseyin Atakan Varol (2022). SpeakingFaces Dataset [Dataset]. https://paperswithcode.com/dataset/speakingfaces
Explore at:
Dataset updated
Apr 28, 2022
Authors
Madina Abdrakhmanova; Askat Kuzdeuov; Sheikh Jarju; Yerbolat Khassanov; Michael Lewis; Huseyin Atakan Varol
Description
SpeakingFaces is a publicly-available large-scale dataset developed to support multimodal machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction (HCI), biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of well-aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases.
d
16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...
datarade.ai
Updated Dec 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 9, 2023
Dataset authored and provided by
Nexdata
Area covered
Turkey, Austria, Saudi Arabia, Malaysia, Canada, Indonesia, Germany, Ecuador, Vietnam, Korea (Republic of)
Description
Specifications Format : 16kHz 16bit, uncompressed wav, mono channel;

Environment : quiet indoor environment, without echo;

Recording content : No preset linguistic data，dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

Device : Android mobile phone, iPhone;

Language : 100+ Languages;

Application scenarios : speech recognition; voiceprint recognition;

Accuracy rate : the word accuracy rate is not less than 98%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
i
Speech Dataset in Hindi Language
ieee-dataport.org
Updated Jun 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Shukla (2020). Speech Dataset in Hindi Language [Dataset]. https://ieee-dataport.org/open-access/speech-dataset-hindi-language
Explore at:
Dataset updated
Jun 9, 2020
Authors
Shivam Shukla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
mp3
o
Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan...
explore.openaire.eu
data.niaid.nih.gov
Updated Sep 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Imade Benelallam; Anass Allak; Abdou Mohamed Naira (2021). Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic [Dataset]. http://doi.org/10.5281/zenodo.5482550
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5482550
Dataset updated
Sep 7, 2021
Authors
Imade Benelallam; Anass Allak; Abdou Mohamed Naira
Area covered
Morocco
Description
Dialectal Voice is a community project initiated by AIOX Labs to facilitate voice recognition by Intelligent Systems. Today, the need for AI systems capable of recognizing the human voice is increasingly expressed within communities. However, we note that for some languages such as Darija, there are not enough voice technology solutions. To meet this need, we then proposed to establish this program of iterative and interactive construction of a dialectal database open to all in order to help improve models of voice recognition and generation.
m
UrduSER: A Dataset for Urdu Speech Emotion Recognition
data.mendeley.com
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Zaheer Akhtar (2025). UrduSER: A Dataset for Urdu Speech Emotion Recognition [Dataset]. http://doi.org/10.17632/jcpfjnk5c2.4
Explore at:
Unique identifier
https://doi.org/10.17632/jcpfjnk5c2.4
Dataset updated
Apr 28, 2025
Authors
Muhammad Zaheer Akhtar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Speech Emotion Recognition (SER) is a rapidly evolving field of research aimed at identifying and categorizing emotional states through the analysis of speech signals. As SER holds significant socio-cultural and commercial importance, researchers are increasingly leveraging machine learning and deep learning techniques to drive advancements in this domain. A high-quality dataset is an essential resource for SER studies in any language. Despite Urdu being the 10th most spoken language globally, there is a significant lack of robust SER datasets, creating a research gap. Existing Urdu SER datasets are often limited by their small size, narrow emotional range, and repetitive content, reducing their applicability in real-world scenarios. To address this gap, the Urdu Speech Emotion Recognition (UrduSER) was developed. This comprehensive dataset includes 3500 Urdu speech signals sourced from 10 professional actors, with an equal representation of male and female speakers from diverse age groups. The dataset encompasses seven emotional states: Angry, Fear, Boredom, Disgust, Happy, Neutral, and Sad. The speech samples were curated from a wide collection of Pakistani Urdu drama serials and telefilms available on YouTube, ensuring diversity and natural delivery. Unlike conventional datasets, which rely on predefined dialogs recorded in controlled environments, UrduSER features unique and contextually varied utterances, making it more realistic and applicable for practical applications. To ensure balance and consistency, the dataset contains 500 samples per emotional class, with 50 samples contributed by each actor for each emotion. Additionally, an accompanying Excel file provides detailed metadata for each recording, including the file name, duration, format, sample rate, actor details, emotional state, and corresponding Urdu dialog. This metadata enables researchers to efficiently organize and utilize the dataset for their specific needs. The UrduSER dataset underwent rigorous validation, integrating expert evaluation and model-based validation to ensure its reliability, accuracy, and overall suitability for advancing research and development in Urdu Speech Emotion Recognition.
8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...
datarade.ai
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 10, 2023
Dataset authored and provided by
Nexdata
Area covered
Vietnam, United Arab Emirates, Argentina, Romania, Philippines, Singapore, Netherlands, Czech Republic, United States of America, Poland
Description
Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

Environment : quiet indoor environment, without echo;

Recording content : No preset linguistic data，dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

Device : Telephony recording system;

Language : 100+ Languages;

Application scenarios : speech recognition; voiceprint recognition;

Accuracy rate : the word accuracy rate is not less than 98%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
h
italian-speech-recognition-dataset
huggingface.co
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UniData (2025). italian-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/italian-speech-recognition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2025
Authors
UniData
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Italian Speech Dataset for recognition task

Dataset comprises 499 hours of telephone dialogues in Italian, collected from 670+ native speakers across various topics and domains, achieving an impressive 98% Word Accuracy Rate. It is designed for research in automatic speech recognition (ASR) systems. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/italian-speech-recognition-dataset.
u
German Speech Recognition Dataset
unidata.pro
wav
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata L.L.C-FZ (2025). German Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/german-speech-recognition-dataset/
Explore at:
wavAvailable download formats
Dataset updated
Mar 19, 2025
Dataset authored and provided by
Unidata L.L.C-FZ
Description
Unidata’s German Speech Recognition dataset enhances AI transcription, ensuring precise speech-to-text conversion and language understanding
Hate Speech and Offensive Language Detection
kaggle.com
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Hate Speech and Offensive Language Detection [Dataset]. https://www.kaggle.com/datasets/thedevastator/hate-speech-and-offensive-language-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hate Speech and Offensive Language Detection

Hate Speech and Offensive Language Detection on Twitter

By hate_speech_offensive (From Huggingface) [source]

About this dataset

This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.

The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.

For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes

How to use the dataset

Introduction:

Dataset Overview:

The dataset is presented in a CSV file format named 'train.csv'.

It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither.

Each row represents a tweet along with the corresponding annotations provided by multiple annotators.

The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language).

Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.

Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.

Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.

Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:

Distribution of tweet counts per classification category (hate speech, offensive language, neither).

Most common words/phrases associated with each class.

Co-occurrence analysis to identify correlations between hate speech and offensive language.

Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/

Research Ideas

Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language.

Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive.

Content Moderation: Social media platforms can use this dataset to improve their content m...
In The Wild (audio Deepfake)
kaggle.com
zip
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdalla Mohamed (2024). In The Wild (audio Deepfake) [Dataset]. https://www.kaggle.com/datasets/abdallamohamed312/in-the-wild-audio-deepfake
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 20, 2024
Authors
Abdalla Mohamed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
'In-the-Wild' Dataset We present a dataset of audio deepfakes (and corresponding benign audio) for a set of politicians and other public figures, collected from publicly available sources such as social networks and video streaming platforms. For n = 58 celebrities and politicians, we collect both bona-fide and spoofed audio. In total, we collect 20.8 hours of bona-fide and 17.2 hours of spoofed audio. On average, there are 23 minutes of bona-fide and 18 minutes of spoofed audio per speaker.

The dataset is intended to be used for evaluating deepfake detection and voice anti-spoofing machine-learning models. It is especially useful to judge a model's capability to generalize to realistic, in-the-wild audio samples. Find more information in our paper, and download the dataset here.

The most interesting deepfake detection models we used in our experiments are open-source on GitHub:

RawNet 2 RawGAT-ST PC-Darts This dataset and the associated documentation are licensed under the Apache License, Version 2.0.

Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...

zenodo.org

application/gzip

Updated Jul 17, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Philipp Klumpp; Philipp Klumpp; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave (2024). Common Phone: A Multilingual Dataset for Robust Acoustic Modelling [Dataset]. http://doi.org/10.5281/zenodo.5846137

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5846137

Dataset updated

Jul 17, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Release Date: 17.01.22

Welcome to Common Phone 1.0

Legal Information

Common Phone is a subset of the Common Voice corpus collected by Mozilla Corporation. By using Common Phone, you agree to the Common Voice Legal Terms. Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU) under the CC0 license.

Like for Common Voice, you must not make any attempt to identify speakers that contributed to Common Phone.

About Common Phone

This corpus aims to provide a basis for Machine Learning (ML) researchers and enthusiasts to train and test their models against a wide variety of speakers, hardware/software ecosystems and acoustic conditions to improve generalization and availability of ML in real-world speech applications.
The current version of Common Phone comprises 116,5 hours of speech samples, collected from 11.246 speakers in 6 languages:

Language	Speakers	Hours
	`train` / `dev` / `test`	`train` / `dev` / `test`
English	4716 / 771 / 774	14.1 / 2.3 / 2.3
French	796 / 138 / 135	13.6 / 2.3 / 2.2
German	1176 / 202 / 206	14.5 / 2.5 / 2.6
Italian	1031 / 176 / 178	14.6 / 2.5 / 2.5
Spanish	508 / 88 / 91	16.5 / 3.0 / 3.1
Russian	190 / 34 / 36	12.7 / 2.6 / 2.8
Total	8417 / 1409 / 1420	85.8 / 15.2 / 15.5

Presented train, dev and test splits are not identical to those shipped with Common Voice. Speaker separation among splits was realized by only using those speakers that had provided age and gender information. This information can only be provided as a registered user on the website. When logged in, the session ID of contributed recordings is always linked to your user, thus we could easily link recordings to individual speakers. Keep in mind this would not be possible for unregistered users, as their session ID changes if they decide to contribute more than once.
During speaker selection, we considered that some speakers had contributed to more than one of the six Common Voice datasets (one for each language). In Common Phone, a speaker will only appear in one language.
The dataset is structured as follows:

Six top-level directories, one for each language.
Each language folder contains:
- [train|dev|test].csv files listing audio files, respective speaker ID and plain text transcript.
- meta.csv provides speaker information: age group, gender, language, accent (if available) and which of the three splits this speaker was assigned to. File names match corresponding audio file names except their extension.
- /grids/ contains phonetic transcription for every audio file in Praat TextGrid format.
- /mp3/ contains audio files in mp3, identical to those of Common Voice, e.g., sampling rates have been preserved and may vary for different files.
- /wav/ contains raw audio files in 16 bits/sample, 16 kHz single channel. They had been created from the original mp3 audios. We provide them for convenience, keep in mind that their source had undergone MP3-compression.

Where does the phonetic annotation come from?

Phonetic annotation was computed via BAS Web Services. We used the regular Pipeline (G2P-MAUS) without ASR to create an alignment of text transcripts with audio signals. We chose International Phonetic Alphabet (IPA) output symbols as they work well even in a multi-lingual setup. Common Phone annotation comprises 101 phonetic symbols, including silence.

Why Common Phone?

Large number of speakers and varying acoustic conditions to improve robustness of ML models
Time-aligned IPA phonetic transcription for every audio sample
Gender-balanced and age-group-matched (equal number of female/male speakers in every age group)
Support for six different languages to leverage multi-lingual approaches
Original MP3 files plus standard WAVE files

Is there any publication available?

Yes, a paper describing Common Phone in detail is currently under revision for LREC 2022. You can access a pre-print version on arXiv entitled “Common Phone: A Multilingual Dataset for Robust Acoustic Modelling”.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition

DEEP-VOICE: DeepFake Voice Recognition Dataset

Jordan Bird

Explore at:

Dataset updated

Aug 23, 2023

Description

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Kaggle The dataset is available on the Kaggle data science platform.

The Kaggle page can be found by clicking here: Dataset on Kaggle

Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Clear search

Close search

Google apps

Main menu

DEEP-VOICE: DeepFake Voice Recognition Dataset

Gender Recognition by Voice(original)

Additional Information:

Key Features:

Arabic Speech Commands Dataset

Dataset for acoustic features extracted from voice recordings of people...

japanese-voice-combined

Russian Speech Recognition Dataset

A kiswahili Dataset for Development of Text-To-Speech System

Persian Speech to Test dataset

Axiom voice recognition dataset

SpeakingFaces Dataset

16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

Speech Dataset in Hindi Language

Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan...

UrduSER: A Dataset for Urdu Speech Emotion Recognition

8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

italian-speech-recognition-dataset

German Speech Recognition Dataset

Hate Speech and Offensive Language Detection

Hate Speech and Offensive Language Detection

Hate Speech and Offensive Language Detection on Twitter

About this dataset

How to use the dataset

Research Ideas

In The Wild (audio Deepfake)

Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...

DEEP-VOICE: DeepFake Voice Recognition Dataset

Jordan Bird