100+ datasets found
  1. P

    DEEP-VOICE: DeepFake Voice Recognition Dataset

    • paperswithcode.com
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition
    Explore at:
    Dataset updated
    Aug 23, 2023
    Description

    DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

    Can machine learning be used to detect when speech is AI-generated?

    Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

    To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

    For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

    (Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

    Dataset There are two forms to the dataset that are made available.

    First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

    Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

    Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

    A potential use of a successful system could be used for the following:

    (Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

    Kaggle The dataset is available on the Kaggle data science platform.

    The Kaggle page can be found by clicking here: Dataset on Kaggle

    Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

    The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

    License This dataset is provided under the MIT License:

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  2. Gender Recognition by Voice(original)

    • kaggle.com
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    murtadha najim (2025). Gender Recognition by Voice(original) [Dataset]. https://www.kaggle.com/datasets/murtadhanajim/gender-recognition-by-voiceoriginal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    murtadha najim
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains WAV audio files for gender classification based on voice. The data is organized into two folders:

    Female: Contains 5,768 audio files of female voices.

    *Male: *Contains 10,400 audio files of male voices.

    Additional Information:

    *Source: *The data was obtained online (link provided in the references section).

    Duplicates: Approximately 1,000 files are duplicated. Ensure to check and handle duplicates when processing the dataset.

    Processed Version: A cleaned and processed version of the dataset is available at this link for convenience.

    Code: The code used for processing this dataset can be found in the Code section.

    Key Features:

    Format: All files are in .wav format.

    length : around 1.5 to 5 seconds with average of 3.26s

    Purpose: Designed for building machine learning models that classify gender based on audio data.

    Applications: Useful in speech recognition systems, voice assistant training, and audio analysis.

    Usage Notes:

    This dataset is great for developing machine learning algorithms in:

    Speech processing

    Gender classification

    Signal analysis

  3. o

    Arabic Speech Commands Dataset

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulkader Ghandoura (2021). Arabic Speech Commands Dataset [Dataset]. http://doi.org/10.5281/zenodo.4662480
    Explore at:
    Dataset updated
    Apr 5, 2021
    Authors
    Abdulkader Ghandoura
    Description

    Arabic Speech Commands Dataset This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks. Dataset Description Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB. Dataset Structure There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time. Data Split We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset. License This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder. Citations If you want to use the Arabic Speech Commands dataset in your work, please cite it as: @article{arabicspeechcommandsv1, author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma}, title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting}, journal = {Engineering Applications of Artificial Intelligence}, year = {2021}, publisher={Elsevier} }

  4. m

    Dataset for acoustic features extracted from voice recordings of people...

    • data.mendeley.com
    Updated Jan 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Carrón (2023). Dataset for acoustic features extracted from voice recordings of people suffering Parkinson’s disease and healthy subjects [Dataset]. http://doi.org/10.17632/fjd6fcfkwn.1
    Explore at:
    Dataset updated
    Jan 26, 2023
    Authors
    Javier Carrón
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains acoustic features extracted from people suffering Parkinson’s disease and healthy subjects.

    All the information on data collection and dataset building is presented in the following paper. This paper is encouraged to be cited in case of any scientific research publication is produced using this dataset:

    Carrón, J., Campos-Roca, Y., Madruga, M. et al. A mobile-assisted voice condition analysis system for Parkinson’s disease: assessment of usability conditions. BioMed Eng OnLine 20, 114 (2021). https://doi.org/10.1186/s12938-021-00951-y

    All the data is included in the .csv file. A brief description of attributes can be found in the .pdf file.

    The research leading to this dataset has been funded by Agencia Estatal de Investigación, Ministerio de Ciencia e Innovación, Spain (Project MTM2017-86875-C3-2-R) and the dataset is part of the project PID2021-122209OB-C32. It has also been funded by Junta de Extremadura, Spain (Projects GR18108 and GR18055), the European Union (European Regional Development Funds), and grant number FPU18/03274.

  5. h

    japanese-voice-combined

    • huggingface.co
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kadir Nar (2025). japanese-voice-combined [Dataset]. https://huggingface.co/datasets/kadirnar/japanese-voice-combined
    Explore at:
    Dataset updated
    Apr 13, 2025
    Authors
    Kadir Nar
    Description

    Japanese Voice Dataset Combined

    This dataset combines multiple high-quality Japanese voice datasets to create a comprehensive collection of Japanese speech data. It's designed to be used for speech recognition, text-to-speech, and other speech-related machine learning tasks.

      Dataset Content
    

    This dataset contains over 86,000 audio samples in Japanese from various high-quality sources. The data comes from the following sources:

    Dataset Estimated Samples

    StoryTTS… See the full description on the dataset page: https://huggingface.co/datasets/kadirnar/japanese-voice-combined.

  6. u

    Russian Speech Recognition Dataset

    • unidata.pro
    wav
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ, Russian Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/russian-speech-recognition-dataset/
    Explore at:
    wavAvailable download formats
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    Unidata provides a Russian Speech Recognition dataset to train AI for seamless speech-to-text conversion

  7. m

    A kiswahili Dataset for Development of Text-To-Speech System

    • data.mendeley.com
    Updated Nov 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
    Explore at:
    Dataset updated
    Nov 30, 2021
    Authors
    Kiptoo Rono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.

  8. Z

    Persian Speech to Test dataset

    • data.niaid.nih.gov
    Updated Dec 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyyed Mohammad Masoud Parpanchi (2022). Persian Speech to Test dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7486153
    Explore at:
    Dataset updated
    Dec 28, 2022
    Dataset authored and provided by
    Seyyed Mohammad Masoud Parpanchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Persian Speech to Text dataset is an open source dataset for training machine learning models for the task of transcribing audio files in the Persian language into text. It is the largest open source dataset of its kind, with a size of approximately 60GB of data. The dataset consists of audio files in the WAV format and their transcripts in CSV file format. This dataset is a valuable resource for researchers and developers working on natural language processing tasks involving the Persian language, and it provides a large and diverse set of data to train and evaluate machine learning models on. The open source nature of the dataset means that it is freely available to be used and modified by anyone, making it an important resource for advancing research and development in the field.

  9. Z

    Axiom voice recognition dataset

    • data.niaid.nih.gov
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Ermini (2024). Axiom voice recognition dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1218978
    Explore at:
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Nicola Bettin
    Antonio Rizzo
    Sara Ermini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AXIOM Voice Dataset has the main purpose of gathering audio recordings from Italian natural language speakers. This voice data collection intended to obtain audio reconding sample for the training and testing of VIMAR algorithm implemented for the Smart Home scenario for the Axiom board. The final goal was to developing an efficient voice recognition system using machine learning algorithms. A team of UX researchers of the University of Siena collected data for five months and tested the voice recognition system on the AXIOM board [1]. The data acquisition process involved natural Italian speakers who provided their written consent to participate in the research project. The participants were selected in order to maintain a cluster with different characteristics in gender, age, region of origin and background.

  10. P

    SpeakingFaces Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Apr 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madina Abdrakhmanova; Askat Kuzdeuov; Sheikh Jarju; Yerbolat Khassanov; Michael Lewis; Huseyin Atakan Varol (2022). SpeakingFaces Dataset [Dataset]. https://paperswithcode.com/dataset/speakingfaces
    Explore at:
    Dataset updated
    Apr 28, 2022
    Authors
    Madina Abdrakhmanova; Askat Kuzdeuov; Sheikh Jarju; Yerbolat Khassanov; Michael Lewis; Huseyin Atakan Varol
    Description

    SpeakingFaces is a publicly-available large-scale dataset developed to support multimodal machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction (HCI), biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of well-aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases.

  11. d

    16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

    • datarade.ai
    Updated Dec 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 9, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Turkey, Austria, Saudi Arabia, Malaysia, Canada, Indonesia, Germany, Ecuador, Vietnam, Korea (Republic of)
    Description
    1. Specifications Format : 16kHz 16bit, uncompressed wav, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Android mobile phone, iPhone;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  12. i

    Speech Dataset in Hindi Language

    • ieee-dataport.org
    Updated Jun 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Shukla (2020). Speech Dataset in Hindi Language [Dataset]. https://ieee-dataport.org/open-access/speech-dataset-hindi-language
    Explore at:
    Dataset updated
    Jun 9, 2020
    Authors
    Shivam Shukla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    mp3

  13. o

    Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Sep 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imade Benelallam; Anass Allak; Abdou Mohamed Naira (2021). Dvoice : An open source dataset for Automatic Speech Recognition on Moroccan dialectal Arabic [Dataset]. http://doi.org/10.5281/zenodo.5482550
    Explore at:
    Dataset updated
    Sep 7, 2021
    Authors
    Imade Benelallam; Anass Allak; Abdou Mohamed Naira
    Area covered
    Morocco
    Description

    Dialectal Voice is a community project initiated by AIOX Labs to facilitate voice recognition by Intelligent Systems. Today, the need for AI systems capable of recognizing the human voice is increasingly expressed within communities. However, we note that for some languages such as Darija, there are not enough voice technology solutions. To meet this need, we then proposed to establish this program of iterative and interactive construction of a dialectal database open to all in order to help improve models of voice recognition and generation.

  14. m

    UrduSER: A Dataset for Urdu Speech Emotion Recognition

    • data.mendeley.com
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Zaheer Akhtar (2025). UrduSER: A Dataset for Urdu Speech Emotion Recognition [Dataset]. http://doi.org/10.17632/jcpfjnk5c2.4
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Muhammad Zaheer Akhtar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Speech Emotion Recognition (SER) is a rapidly evolving field of research aimed at identifying and categorizing emotional states through the analysis of speech signals. As SER holds significant socio-cultural and commercial importance, researchers are increasingly leveraging machine learning and deep learning techniques to drive advancements in this domain. A high-quality dataset is an essential resource for SER studies in any language. Despite Urdu being the 10th most spoken language globally, there is a significant lack of robust SER datasets, creating a research gap. Existing Urdu SER datasets are often limited by their small size, narrow emotional range, and repetitive content, reducing their applicability in real-world scenarios. To address this gap, the Urdu Speech Emotion Recognition (UrduSER) was developed. This comprehensive dataset includes 3500 Urdu speech signals sourced from 10 professional actors, with an equal representation of male and female speakers from diverse age groups. The dataset encompasses seven emotional states: Angry, Fear, Boredom, Disgust, Happy, Neutral, and Sad. The speech samples were curated from a wide collection of Pakistani Urdu drama serials and telefilms available on YouTube, ensuring diversity and natural delivery. Unlike conventional datasets, which rely on predefined dialogs recorded in controlled environments, UrduSER features unique and contextually varied utterances, making it more realistic and applicable for practical applications. To ensure balance and consistency, the dataset contains 500 samples per emotional class, with 50 samples contributed by each actor for each emotion. Additionally, an accompanying Excel file provides detailed metadata for each recording, including the file name, duration, format, sample rate, actor details, emotional state, and corresponding Urdu dialog. This metadata enables researchers to efficiently organize and utilize the dataset for their specific needs. The UrduSER dataset underwent rigorous validation, integrating expert evaluation and model-based validation to ensure its reliability, accuracy, and overall suitability for advancing research and development in Urdu Speech Emotion Recognition.

  15. 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Vietnam, United Arab Emirates, Argentina, Romania, Philippines, Singapore, Netherlands, Czech Republic, United States of America, Poland
    Description
    1. Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Telephony recording system;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  16. h

    italian-speech-recognition-dataset

    • huggingface.co
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UniData (2025). italian-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/italian-speech-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Authors
    UniData
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Italian Speech Dataset for recognition task

    Dataset comprises 499 hours of telephone dialogues in Italian, collected from 670+ native speakers across various topics and domains, achieving an impressive 98% Word Accuracy Rate. It is designed for research in automatic speech recognition (ASR) systems. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/italian-speech-recognition-dataset.

  17. u

    German Speech Recognition Dataset

    • unidata.pro
    wav
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ (2025). German Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/german-speech-recognition-dataset/
    Explore at:
    wavAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    Unidata’s German Speech Recognition dataset enhances AI transcription, ensuring precise speech-to-text conversion and language understanding

  18. Hate Speech and Offensive Language Detection

    • kaggle.com
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Hate Speech and Offensive Language Detection [Dataset]. https://www.kaggle.com/datasets/thedevastator/hate-speech-and-offensive-language-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hate Speech and Offensive Language Detection

    Hate Speech and Offensive Language Detection on Twitter

    By hate_speech_offensive (From Huggingface) [source]

    About this dataset

    This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.

    The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.

    For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes

    How to use the dataset

    • Introduction:

    • Dataset Overview:

      • The dataset is presented in a CSV file format named 'train.csv'.
      • It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither.
      • Each row represents a tweet along with the corresponding annotations provided by multiple annotators.
      • The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language).
    • Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.

    • Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.

    • Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.

    • Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:

      • Distribution of tweet counts per classification category (hate speech, offensive language, neither).
      • Most common words/phrases associated with each class.
      • Co-occurrence analysis to identify correlations between hate speech and offensive language.
    • Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/

    Research Ideas

    • Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language.
    • Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive.
    • Content Moderation: Social media platforms can use this dataset to improve their content m...
  19. In The Wild (audio Deepfake)

    • kaggle.com
    zip
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdalla Mohamed (2024). In The Wild (audio Deepfake) [Dataset]. https://www.kaggle.com/datasets/abdallamohamed312/in-the-wild-audio-deepfake
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 20, 2024
    Authors
    Abdalla Mohamed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    'In-the-Wild' Dataset We present a dataset of audio deepfakes (and corresponding benign audio) for a set of politicians and other public figures, collected from publicly available sources such as social networks and video streaming platforms. For n = 58 celebrities and politicians, we collect both bona-fide and spoofed audio. In total, we collect 20.8 hours of bona-fide and 17.2 hours of spoofed audio. On average, there are 23 minutes of bona-fide and 18 minutes of spoofed audio per speaker.

    The dataset is intended to be used for evaluating deepfake detection and voice anti-spoofing machine-learning models. It is especially useful to judge a model's capability to generalize to realistic, in-the-wild audio samples. Find more information in our paper, and download the dataset here.

    The most interesting deepfake detection models we used in our experiments are open-source on GitHub:

    RawNet 2 RawGAT-ST PC-Darts This dataset and the associated documentation are licensed under the Apache License, Version 2.0.

  20. Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...

    • zenodo.org
    application/gzip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Klumpp; Philipp Klumpp; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave (2024). Common Phone: A Multilingual Dataset for Robust Acoustic Modelling [Dataset]. http://doi.org/10.5281/zenodo.5846137
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Philipp Klumpp; Philipp Klumpp; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Release Date: 17.01.22

    Welcome to Common Phone 1.0

    Legal Information

    Common Phone is a subset of the Common Voice corpus collected by Mozilla Corporation. By using Common Phone, you agree to the Common Voice Legal Terms. Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU) under the CC0 license.

    Like for Common Voice, you must not make any attempt to identify speakers that contributed to Common Phone.

    About Common Phone

    This corpus aims to provide a basis for Machine Learning (ML) researchers and enthusiasts to train and test their models against a wide variety of speakers, hardware/software ecosystems and acoustic conditions to improve generalization and availability of ML in real-world speech applications.
    The current version of Common Phone comprises 116,5 hours of speech samples, collected from 11.246 speakers in 6 languages:

    Language

    Speakers

    Hours

    train / dev / test

    train / dev / test

    English

    4716 / 771 / 774

    14.1 / 2.3 / 2.3

    French

    796 / 138 / 135

    13.6 / 2.3 / 2.2

    German

    1176 / 202 / 206

    14.5 / 2.5 / 2.6

    Italian

    1031 / 176 / 178

    14.6 / 2.5 / 2.5

    Spanish

    508 / 88 / 91

    16.5 / 3.0 / 3.1

    Russian

    190 / 34 / 36

    12.7 / 2.6 / 2.8

    Total

    8417 / 1409 / 1420

    85.8 / 15.2 / 15.5

    Presented train, dev and test splits are not identical to those shipped with Common Voice. Speaker separation among splits was realized by only using those speakers that had provided age and gender information. This information can only be provided as a registered user on the website. When logged in, the session ID of contributed recordings is always linked to your user, thus we could easily link recordings to individual speakers. Keep in mind this would not be possible for unregistered users, as their session ID changes if they decide to contribute more than once.
    During speaker selection, we considered that some speakers had contributed to more than one of the six Common Voice datasets (one for each language). In Common Phone, a speaker will only appear in one language.
    The dataset is structured as follows:

    • Six top-level directories, one for each language.
    • Each language folder contains:
      • [train|dev|test].csv files listing audio files, respective speaker ID and plain text transcript.
      • meta.csv provides speaker information: age group, gender, language, accent (if available) and which of the three splits this speaker was assigned to. File names match corresponding audio file names except their extension.
      • /grids/ contains phonetic transcription for every audio file in Praat TextGrid format.
      • /mp3/ contains audio files in mp3, identical to those of Common Voice, e.g., sampling rates have been preserved and may vary for different files.
      • /wav/ contains raw audio files in 16 bits/sample, 16 kHz single channel. They had been created from the original mp3 audios. We provide them for convenience, keep in mind that their source had undergone MP3-compression.

    Where does the phonetic annotation come from?

    Phonetic annotation was computed via BAS Web Services. We used the regular Pipeline (G2P-MAUS) without ASR to create an alignment of text transcripts with audio signals. We chose International Phonetic Alphabet (IPA) output symbols as they work well even in a multi-lingual setup. Common Phone annotation comprises 101 phonetic symbols, including silence.

    Why Common Phone?

    • Large number of speakers and varying acoustic conditions to improve robustness of ML models
    • Time-aligned IPA phonetic transcription for every audio sample
    • Gender-balanced and age-group-matched (equal number of female/male speakers in every age group)
    • Support for six different languages to leverage multi-lingual approaches
    • Original MP3 files plus standard WAVE files

    Is there any publication available?

    Yes, a paper describing Common Phone in detail is currently under revision for LREC 2022. You can access a pre-print version on arXiv entitled “Common Phone: A Multilingual Dataset for Robust Acoustic Modelling”.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition

DEEP-VOICE: DeepFake Voice Recognition Dataset

Jordan Bird

Explore at:
Dataset updated
Aug 23, 2023
Description

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Kaggle The dataset is available on the Kaggle data science platform.

The Kaggle page can be found by clicking here: Dataset on Kaggle

Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Search
Clear search
Close search
Google apps
Main menu