82 datasets found

common_voice_12_0
huggingface.co
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
Explore at:
Dataset updated
Mar 24, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 12.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
h
cm.trial
huggingface.co
Updated Feb 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
taqwa mohamed (2023). cm.trial [Dataset]. https://huggingface.co/datasets/taqwa92/cm.trial
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2023
Authors
taqwa mohamed
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 11.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/taqwa92/cm.trial.
Common Voice Corpus 5.1
kaggle.com
zip
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krish Baisoya (2023). Common Voice Corpus 5.1 [Dataset]. https://www.kaggle.com/datasets/krishbaisoya/cv-en-5
Explore at:
zip(54099708635 bytes)Available download formats
Dataset updated
Sep 15, 2023
Authors
Krish Baisoya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Common Voice is a corpus of speech data read by users on the Common Voice website, and based upon text from a number of public domain sources like user submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.

How it is collected ?

In google colab, i downloaded the .tar.gz from common-voice (mozilla). And placed the compressed file in a folder marked the folder as dataset and straight-up uploaded it
common_voice_4_0
huggingface.co
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2022). common_voice_4_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_4_0
Explore at:
Dataset updated
Feb 23, 2022
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 4

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 4257 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 3401 validated hours in 40 languages, but more voices and languages are always added. Take a look at the Languages page to request a… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_4_0.

Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...

zenodo.org

application/gzip

Updated Jul 17, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Philipp Klumpp; Philipp Klumpp; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave; Tomás Arias-Vergara; Paula Andrea Pérez-Toro; Elmar Nöth; Juan Rafael Orozco-Arroyave (2024). Common Phone: A Multilingual Dataset for Robust Acoustic Modelling [Dataset]. http://doi.org/10.5281/zenodo.5846137

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5846137

Dataset updated

Jul 17, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Release Date: 17.01.22

Welcome to Common Phone 1.0

Legal Information

Common Phone is a subset of the Common Voice corpus collected by Mozilla Corporation. By using Common Phone, you agree to the Common Voice Legal Terms. Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU) under the CC0 license.

Like for Common Voice, you must not make any attempt to identify speakers that contributed to Common Phone.

About Common Phone

This corpus aims to provide a basis for Machine Learning (ML) researchers and enthusiasts to train and test their models against a wide variety of speakers, hardware/software ecosystems and acoustic conditions to improve generalization and availability of ML in real-world speech applications.
The current version of Common Phone comprises 116,5 hours of speech samples, collected from 11.246 speakers in 6 languages:

Language	Speakers	Hours
	`train` / `dev` / `test`	`train` / `dev` / `test`
English	4716 / 771 / 774	14.1 / 2.3 / 2.3
French	796 / 138 / 135	13.6 / 2.3 / 2.2
German	1176 / 202 / 206	14.5 / 2.5 / 2.6
Italian	1031 / 176 / 178	14.6 / 2.5 / 2.5
Spanish	508 / 88 / 91	16.5 / 3.0 / 3.1
Russian	190 / 34 / 36	12.7 / 2.6 / 2.8
Total	8417 / 1409 / 1420	85.8 / 15.2 / 15.5

Presented train, dev and test splits are not identical to those shipped with Common Voice. Speaker separation among splits was realized by only using those speakers that had provided age and gender information. This information can only be provided as a registered user on the website. When logged in, the session ID of contributed recordings is always linked to your user, thus we could easily link recordings to individual speakers. Keep in mind this would not be possible for unregistered users, as their session ID changes if they decide to contribute more than once.
During speaker selection, we considered that some speakers had contributed to more than one of the six Common Voice datasets (one for each language). In Common Phone, a speaker will only appear in one language.
The dataset is structured as follows:

Six top-level directories, one for each language.
Each language folder contains:
- [train|dev|test].csv files listing audio files, respective speaker ID and plain text transcript.
- meta.csv provides speaker information: age group, gender, language, accent (if available) and which of the three splits this speaker was assigned to. File names match corresponding audio file names except their extension.
- /grids/ contains phonetic transcription for every audio file in Praat TextGrid format.
- /mp3/ contains audio files in mp3, identical to those of Common Voice, e.g., sampling rates have been preserved and may vary for different files.
- /wav/ contains raw audio files in 16 bits/sample, 16 kHz single channel. They had been created from the original mp3 audios. We provide them for convenience, keep in mind that their source had undergone MP3-compression.

Where does the phonetic annotation come from?

Phonetic annotation was computed via BAS Web Services. We used the regular Pipeline (G2P-MAUS) without ASR to create an alignment of text transcripts with audio signals. We chose International Phonetic Alphabet (IPA) output symbols as they work well even in a multi-lingual setup. Common Phone annotation comprises 101 phonetic symbols, including silence.

Why Common Phone?

Large number of speakers and varying acoustic conditions to improve robustness of ML models
Time-aligned IPA phonetic transcription for every audio sample
Gender-balanced and age-group-matched (equal number of female/male speakers in every age group)
Support for six different languages to leverage multi-lingual approaches
Original MP3 files plus standard WAVE files

Is there any publication available?

Yes, a paper describing Common Phone in detail is currently under revision for LREC 2022. You can access a pre-print version on arXiv entitled “Common Phone: A Multilingual Dataset for Robust Acoustic Modelling”.

common_voice_6_0
huggingface.co
Updated Aug 14, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2022). common_voice_6_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_6_0
Explore at:
Dataset updated
Aug 14, 2022
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 6.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 9261 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 7327 validated hours in 60 languages, but more voices and languages are always added. Take a look at the Languages page to request… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_6_0.
common_voice_17_0
huggingface.co
Updated Apr 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2024). common_voice_17_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
Explore at:
Dataset updated
Apr 24, 2024
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 17.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 31175 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 20408 validated hours in 124 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0.
Tonal languages from mozilla common voice 10
kaggle.com
Updated Feb 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrique Díaz-Ocampo (2023). Tonal languages from mozilla common voice 10 [Dataset]. https://www.kaggle.com/datasets/enriquedazocampo/tonal-languages-mozilla-common-voice/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2023
Dataset provided by
Kaggle
Authors
Enrique Díaz-Ocampo
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The following dataset is intended to be used for gender recognition using audio files in uncontrolled environments from the Mozilla Common Voice Dataset 10.0. It consists of a table of descriptive statistical characteristics of the fundamental frequency of six tonal languages Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Thai, Vietnamese, and Punjabi. In addition, the estimation of the vocal tract of each of the speakers.

This dataset contains 18 columns: 'client_id': id speaker from Mozilla Common Voice 'path': Name of the mp3 file 'sentence': The sentence spoken by the speaker 'age': Age in decades (teens, twenties, etc.) 'gender': Binary gender (male or female) 'duration': Duration of mp3 in seconds 'vocal_tract_length': Vocal tract length in cm. 'mean_F4': Mean of the fourth formant in Hz. 'min_pitch': Minimal pitch of the whole pitch contour in Hz. 'mean_pitch': Mean pitch of the whole pitch contour in Hz. 'q1_pitch': : First quartile of the whole pitch contour in Hz. 'median_pitch': : Median pitch of the whole pitch contour Hz. 'q3_pitch': : Third quartile of the whole pitch contour in Hz. 'max_pitch': : Max pitch of the whole pitch contour in Hz. 'stddev_pitch' : Standard deviation of the whole pitch contour in Hz. 'estimated_age': Nominal value (adult or teen) 'estimated_age_gender: Nominal value (adult-male, adult-female, teen-male and teen-female). 'language': Nominal value (Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Thai, Vietnamese, and Punjabi).

The methodology for the extraction of these characteristics was the following:

Only the audios from the valid.tsv file of the respective language were analyzed (this file is contained in the Mozilla Common Voice Dataset https://commonvoice.mozilla.org/en/datasets ) the voiced-speech was extracted using Praat's algorithm Vocal ToolKit (https://www.praatvocaltoolkit.com/extract-voiced-and-unvoiced.html)

2) The vocal tract length was calculated with the Vocal Tool Kit algorithm ( https://www.praatvocaltoolkit.com/calculate-vocal-tract-length.html ) as follows: If the audio came from a teen, then the maximum formant was established at 8000, otherwise it was adjusted to 5000 Hz for men and 5500 for women. Finally, the mean of the fourth formant was calculated for the windows with voiced speech only.

3) The fundamental frequency was calculated using the PRAAT Software in the To Pitch (ac) option and a) Time step (s) 0.0 (=auto) b) Pitch floor (Hz) 75.0 c) Max. number of candidates 15 d) Vey accurate=True e) Silence Threshold= 0.03 f) Voicing threshold= 0.45 g) Octave Cost= 0.01 h) Octave jump cost = 0.35 i) Voiced/ Unvoiced cost= 0.14 j) Pitch ceiling (Hz) = 350

4) The statistical characteristics of the fundamental frequency were calculated only in the windows that were detected as voiced speech.
O
Common Voice
opendatalab.com
huggingface.co
zip
Updated Nov 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artie, Inc. (2022). Common Voice [Dataset]. https://opendatalab.com/OpenDataLab/Common_Voice
Explore at:
zipAvailable download formats
Dataset updated
Nov 30, 2022
Dataset provided by
Artie, Inc.
Mozilla
Indiana University
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
h
test
huggingface.co
Updated Aug 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mhj (2023). test [Dataset]. https://huggingface.co/datasets/gogogogo-1/test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2023
Authors
mhj
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 10.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 20817 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 15234 validated hours in 96 languages, but more voices and languages are always added. Take a look at the Languages page… See the full description on the dataset page: https://huggingface.co/datasets/gogogogo-1/test.
ITC-SMC: A diverse dataset for audio file fragment classification
figshare.com
application/x-rar
Updated Nov 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zahra Mosuavi; mehdi teimouri (2024). ITC-SMC: A diverse dataset for audio file fragment classification [Dataset]. http://doi.org/10.6084/m9.figshare.27175536.v7
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27175536.v7
Dataset updated
Nov 14, 2024
Dataset provided by
figshare
Authors
Zahra Mosuavi; mehdi teimouri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
this audio dataset contains most common used contents in file fragment classification such as Music, Speech and Phone Call
common_voice_3_0
huggingface.co
Updated Aug 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2022). common_voice_3_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_3_0
Explore at:
Dataset updated
Aug 20, 2022
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 3

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 2454 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 1979 validated hours in 29 languages, but more voices and languages are always added. Take a look at the Languages page to request a… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_3_0.
F
American English Scripted Monologue Speech Data for Healthcare
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). American English Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-english-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
United States
Dataset funded by
FutureBeeAI
Description
Introduction
Introducing the US English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in US English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers: 60 native US English speakers.

•
Regional Balance: Participants are sourced from multiple regions across United States of America, reflecting diverse dialects and linguistic traits.

•
Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications
•
Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•
Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•
Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•
Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names: Gender- and region-appropriate United States of America names

•
Addresses: Varied local address formats spoken naturally

•
Dates & Times: References to appointment dates, times, follow-ups, and schedules

•
Medical Terminology: Common medical procedures, symptoms, and treatment references

•
Numbers & Measurements: Health data like dosages, vitals, and test result values

•
Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•
Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•
Z
BirdVox-ANAFCC: A dataset for American Northeast Avian Flight Call...
data.niaid.nih.gov
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bill Evans (2022). BirdVox-ANAFCC: A dataset for American Northeast Avian Flight Call Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3666781
Explore at:
Dataset updated
Feb 3, 2022
Dataset provided by
Vincent Lostanlen
Justin Salamon
Andrew Farnsworth
Aurora Cramer
Bill Evans
Juan Pablo Bello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Northeastern United States, United States
Description
BirdVox-ANAFCC: A dataset for American Northeast Avian Flight Call Classification

Version 2.0, February 2022.

https://wp.nyu.edu/birdvox

Description

BirdVox-ANAFCC is a dataset of short audio waveforms, each of them containing a flight call from one of 14 birds of North America: four American sparrows, one cardinal, two thrushes, and seven New World warblers. * American Tree Sparrow (ATSP) * Chipping Sparrow (CHSP) * Savannah Sparrow (SAVS) * White-throated Sparrow (WTSP) * Red-breasted Grosbeak (RBGR) * Gray-cheeked Thrush (GCTH) * Swainson's Thrush (SWTH) * American Redstart (AMRE) * Bay-breasted Warbler (BBWA) * Black-throated Blue Warbler (BTBW) * Canada Warbler (CAWA) * Common Yellowthroat (COYE) * Mourning Warbler (MOWA) * Ovenbird (OVEN)

It also contains other sounds which are often confused for one of the species above. These "confounding factors" encompass flight calls from other species of birds, vocalizations from non-avian animals, as well as some machine beeps.

BirdVox-ANAFCC results from an aggregation of various smaller datasets, integrated under a common taxonomy. For more details on this taxonomy, we refer the reader to [1]:

[1] Cramer, Lostanlen, Salamon, Farnsworth, Bello. Chirping up the right tree: Incorporating biological taxonomies into deep bioacoustic classifiers. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.

The second version of the BirdVox-ANAFCC dataset (v2.0) contains flight calls from the BirdVox-full-night dataset. These flight calls were present in the ICASSP 2020 benchmark but did not appear in the initial release of BirdVox-ANAFCC.

Data Files

BirdVox-ANAFCC contains the recordings as HDF5 files, sampled at 22,050 Hz, with a single channel (mono). Each HDF5 file contains flight call vocalizations of a particular species. The name of each HDF5 file follows the format: _original.h5. The name of the HDF5 dataset in each file is "waveforms", with the corresponding key for each audio recording varying in format depending on the data source.

Metadata Files

taxonomy.yaml details the three-level taxonomy structure used in this dataset, reflected in three-number-codes which largely follow "..". Additionally, at any level of the taxonomy, the numeric code "0" is reserved for "other" and the code "X" refers to unknown. For example, 1.1.0 corresponds to an American Sparrow with a species outside of our scope of interest, and 1.1.X corresponds to an American Sparrow of unknown species. At the top level (family), the "other" codes (0.*.*) deviate from the family-order-species in order to capture a variety of other out-of-scope sounds, including anthropophony, non-avian biophony, and biophony of avians outside of the scope of interest.

Please acknowledge BirdVox-ANAFCC in academic research

When BirdVox-ANAFCC is used for academic research, we would highly appreciate it if scientific publications of works partly based on this dataset cite the following publication:

Cramer, Lostanlen, Salamon, Farnsworth, Bello. Chirping up the right tree: Incorporating biological taxonomies into deep bioacoustic classifiers. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.

The creation of this dataset was supported by NSF grants 1125098 (BIRDCAST) and 1633259 (BIRDVOX), a Google Faculty Award, the Leon Levy Foundation, and two anonymous donors.

Conditions of Use

Dataset created by Aurora Cramer, Vincent Lostanlen, Bill Evans, Andrew Farnsworth, Justin Salamon, and Juan Pablo Bello.

The BirdVox-ANAFCC dataset is offered free of charge under the terms of the Creative Commons Attribution International License: https://creativecommons.org/licenses/by/4.0/

The dataset and its contents are made available on an "as is" basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the authors are not liable for, and expressly exclude all liability for, loss or damage however and whenever caused to anyone by any use of the BirdVox-ANAFCC dataset or any part of it.

Feedback

Please help us improve BirdVox-full-night by sending your feedback to: vincent.lostanlen@gmail.com and auroracramer@nyu.edu

In case of a problem, please include as many details as possible.

Versions

1.0, May 2020: initial version, paired with ICASSP 2020 publication. 2.0, February 2022: added a missing dataset file (BirdVox-70k), updated name of first author (Aurora Cramer).

Acknowledgement

Jessie Barry, Ian Davies, Tom Fredericks, Jeff Gerbracht, Sara Keen, Holger Klinck, Anne Klingensmith, Ray Mack, Peter Marchetto, Ed Moore, Matt Robbins, Ken Rosenberg, and Chris Tessaglia-Hymes.

We thank contributors and maintainers of the Macaulay Library and the Xeno-Canto website.

We acknowledge that the land on which the data was collected is the unceded territory of the Cayuga nation, which is part of the Haudenosaunee (Iroquois) confederacy.
common_voice_9_0
huggingface.co
Updated Jul 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2022). common_voice_9_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0
Explore at:
Dataset updated
Jul 24, 2022
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 9.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 20217 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 14973 validated hours in 93 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0.
Fart Recordings Dataset
kaggle.com
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alec Ledoux (2025). Fart Recordings Dataset [Dataset]. https://www.kaggle.com/datasets/alecledoux/fart-recordings-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alec Ledoux
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description

• This dataset contains over 7500 fart recordings that were collected over a period of 37 months. I created this dataset for educational purposes, to allow others to study and experiment with audio, and signal processing more broadly.

• The files are in .wav format. I recorded all of the farts using a voice recording app in whatever environment I was in at the time. Thus, there may be background noise, people talking, and variations in volume and clarity (note: I generally step far away to fart if others are present). I validate every single recording and delete those with low volume, too much background noise, and sounds too similar to farts (e.g. phone vibrations). Over time, I have decreased my threshold for acceptable background noise and strive to have minimal or no extraneous noise. There are numerous types of farts, as well. I did not record every fart I produced during the time period, rather I recorded when available to do so. Thus, this data is not inclusive of all emitted farts.

• I did not perform any preprocessing on this data to maintain its versatility. For most audio tasks, you may consider trimming files to be of a consistent duration, as well as other common audio preprocessing techniques.

• The files are named as integers, starting from 1. The order of files bears no significance.

• If you are using these files specifically for fart classification or fart recognition, please bear in mind that this data is biased towards my farts. Consequently, you may find that a model recognizes someone else's farts, such as your own, with different results.

Suggested Uses

• Unsupervised signal classification - You can experiment with categorizing farts without any preexisting knowledge of defining characteristics and potentially apply these learnings to other signal types - speech, radar, tv, radio, light, EEG.

• Supervised signal recognition - This dataset could be used to experiment with developing deep learning models capable of recognizing whether a sound is a fart. An interesting property of farts is variable frequencies and inconsistent durations.

• Sound effects creation - This dataset could be used by sound designers or audio engineers as a basis to create new sound effects for movies, video games, or other media. You could also simply use it as a publicly available and free source of farts.

• Education and outreach - Educators and scientists can use this dataset as an approach to better engage their audiences in signal processing and deep learning.

License

• This data is publicly and freely available to use and modify however you would like. There is no license and no limitations for use. I would appreciate being notified of this data being used publicly, purely for my own entertainment.
WOLOF TTS(Text To Speech) Data
zenodo.org
bin, tsv, zip
Updated Feb 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thierno Ibrahima Diop; Demba AW; Ami jaane; Mamadou Badiane; Thierno Ibrahima Diop; Demba AW; Ami jaane; Mamadou Badiane (2021). WOLOF TTS(Text To Speech) Data [Dataset]. http://doi.org/10.5281/zenodo.4498861
Explore at:
zip, tsv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4498861
Dataset updated
Feb 4, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thierno Ibrahima Diop; Demba AW; Ami jaane; Mamadou Badiane; Thierno Ibrahima Diop; Demba AW; Ami jaane; Mamadou Badiane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This contains a WOLOF Text To Speech(TTS) dataset, it contains recordings from two natif Wolof actos (a male and female voice).
Each actor recored more than 20 000 sentences.
The notebook accompanying the dataset contains a brief analysis of the dataset and the code creating the appropriate train/validation and test set.
The file [male, female]train, [male, female]validation and [male, female]test are also present to extract the corespondig audios inside the data-commonvoice.zip

The text dataset come from news website, Wikipedia and self curated text. We made sure with the help of our Wolof expert that the text dataset cover the different phonemes in the Wolof language.
F
Italian Scripted Monologue Speech Data for Healthcare
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Italian Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-italian-italy
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Introducing the Italian Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Italian language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Italian, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers: 60 native Italian speakers.

•
Regional Balance: Participants are sourced from multiple regions across Italy, reflecting diverse dialects and linguistic traits.

•
Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications
•
Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•
Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•
Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•
Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names: Gender- and region-appropriate Italy names

•
Addresses: Varied local address formats spoken naturally

•
Dates & Times: References to appointment dates, times, follow-ups, and schedules

•
Medical Terminology: Common medical procedures, symptoms, and treatment references

•
Numbers & Measurements: Health data like dosages, vitals, and test result values

•
Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•
Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•
<b style="font-weight:
Audio Datasets of belt conveyor rollers in mines
figshare.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Liu; Shiming Fu; Fen Liu; Xuefeng Cheng (2024). Audio Datasets of belt conveyor rollers in mines [Dataset]. http://doi.org/10.6084/m9.figshare.27051424.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27051424.v2
Dataset updated
Sep 19, 2024
Dataset provided by
figshare
Authors
Juan Liu; Shiming Fu; Fen Liu; Xuefeng Cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset for this project comprises audio recordings of the operational states of belt conveyor rollers in a mining environment, covering three conditions: normal roller operation, roller shell cracking, and roller breakage. Combined with machine learning models, this dataset can be used for real-time diagnosis of roller operational states. The database contains two main folders: dataset and code.The dataset folder includes three subfolders:wav: Contains 19 WAV files recorded from 19 microphones, capturing the audio data of belt conveyor rollers in a mining site. Of these, 17 files represent normal roller operation, 1 file captures the audio of a roller with shell cracking, and 1 file captures the audio of a roller with complete breakage.csv_dataset: Contains 10 subfolders, each representing audio feature datasets extracted from the WAV files with frame lengths ranging from 100ms to 1000ms. Each subfolder contains 19 CSV files, corresponding to the 19 audio recordings. The feature datasets within different frame-length subfolders should not be used interchangeably.test_dataset: Contains 17 audio feature datasets with a 200ms frame length. These datasets include features from 17 normal operation recordings combined with features from the roller shell cracking and roller breakage recordings. The combined datasets are shuffled 100 times to ensure even distribution of features from each operational state. This dataset was used for validating the accuracy and usability of the audio feature datasets for real-time monitoring of roller states in the paper.The code folder contains two sets of code:Matlab Code: This code extracts 25 audio features from the WAV files and generates the 17 audio feature datasets using a 200ms frame length.Python Code: This code validates the accuracy and usability of the audio feature datasets in real-time monitoring of belt conveyor roller operational states.This dataset and code combination supports the real-time diagnosis of belt conveyor roller conditions and provides a foundation for validating the effectiveness of audio features in fault detection.
Z
DCASE 2020 Challenge Task 2 Development Dataset
data.niaid.nih.gov
zenodo.org
Updated May 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toshiki Nakamura (2022). DCASE 2020 Challenge Task 2 Development Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3678170
Explore at:
Dataset updated
May 24, 2022
Dataset provided by
Keisuke Imoto
Ryo Tanabe
Kaori Suefusa
Harsh Purohit
Takashi Endo
Toshiki Nakamura
Yuki Nikaido
Masahito Yasuda
Noboru Harada
Yuma Koizumi
Yohei Kawaguchi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is the "development dataset" for the DCASE 2020 Challenge Task 2 "Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring" [task description].

The data comprises parts of ToyADMOS and the MIMII Dataset consisting of the normal/anomalous operating sounds of six types of toy/real machines. Each recording is a single-channel (proximately) 10-sec length audio that includes both a target machine's operating sound and environmental noise. The following six types of toy/real machines are used in this task:

Toy-car (ToyADMOS)

Toy-conveyor (ToyADMOS)

Valve (MIMII Dataset)

Pump (MIMII Dataset)

Fan (MIMII Dataset)

Slide rail (MIMII Dataset)

Recording procedure

The ToyADMOS consists of normal/anomalous operating sounds of miniature machines (toys) collected with four microphones, and the MIMII dataset consists of those of real-machines collected with eight microphones. Anomalous sounds in these datasets were collected by deliberately damaging target machines. For simplifying the task, we used only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. The sampling rate of all signals has been downsampled to 16 kHz. From ToyADMOS, we used only IND-type data that contain the operating sounds of the entire operation (i.e., from start to stop) in a recording. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. For the details of the recording procedure, please refer to the papers of ToyADMOS and MIMII Dataset.

Data

We first define two important terms in this task: Machine Type and Machine ID. Machine Type means the kind of machine, which in this task can be one of six: toy-car, toy-conveyor, valve, pump, fan, and slide rail. Machine ID is the identifier of each individual of the same type of machine, which in the training dataset can be of three or four. Each machine ID's dataset consists of (i) around 1,000 samples of normal sounds for training and (ii) 100-200 samples each of normal and anomalous sounds for the test. The given labels for each training/test sample are Machine Type, Machine ID, and condition (normal/anomaly). Machine Type information is given by directory name, and Machine ID and condition information are given by their respective file names.

Directory structure

When you unzip the downloaded files from Zenodo, you can see the following directory structure. As described in the previous section, Machine Type information is given by directory name, and Machine ID and condition information are given by file name, as:

/dev_data

/ToyCar

/train (Only normal data for all Machine IDs are included.)

/normal_id_01_00000000.wav

...

/normal_id_01_00000999.wav

/normal_id_02_00000000.wav

...

/normal_id_04_00000999.wav

/test (Normal and anomaly data for all Machine IDs are included.)

/normal_id_01_00000000.wav

...

/normal_id_01_00000349.wav

/anomaly_id_01_00000000.wav

...

/anomaly_id_01_00000263.wav

/normal_id_02_00000000.wav

...

/anomaly_id_04_00000264.wav

/ToyConveyor (The other Machine Types have the same directory structure as ToyCar.)

/fan

/pump

/slider

/valve

The paths of audio files are:

"/dev_data//train/normal_id_[0-9]+.wav"

"/dev_data//test/normal_id_[0-9]+.wav"

"/dev_data//test/anomaly_id_[0-9]+.wav"

For example, the Machine Type and Machine ID of "/ToyCar/train/normal_id_01_00000000.wav" are "ToyCar" and "01", respectively, and its condition is normal. The Machine Type and Machine ID of "/fan/test/anomaly_id_00_00000000.wav" are "fan" and "00", respectively, and its condition is anomalous.

Baseline system

A simple baseline system is available on the Github repository [URL]. The baseline system provides a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. It is a good starting point, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Conditions of use

This dataset was created jointly by NTT Corporation and Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Publication

If you use this dataset, please cite all the following three papers:

Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu, and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc of Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019. [pdf]

Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019. [pdf]

Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, Toshiki Nakamura, Yuki Nikaido, Ryo Tanabe, Harsh Purohit, Kaori Suefusa, Takashi Endo, Masahiro Yasuda, and Noboru Harada, "Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring," in Proc. 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2020. [pdf]

Feedback

If there is any problem, please contact us:

Yuma Koizumi, koizumi.yuma@ieee.org

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com

Keisuke Imoto, keisuke.imoto@ieee.org

Facebook

Twitter

Click to copy link

Link copied

Cite

Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0

common_voice_12_0

Common Voice Corpus 12.0

mozilla-foundation/common_voice_12_0

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 24, 2023

Dataset authored and provided by

Mozilla Foundationhttp://mozilla.org/

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Dataset Card for Common Voice Corpus 12.0

  Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.

Clear search

Close search

Google apps

Main menu

common_voice_12_0

cm.trial

Common Voice Corpus 5.1

How it is collected ?

common_voice_4_0

Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...

common_voice_6_0

common_voice_17_0

Tonal languages from mozilla common voice 10

Common Voice

test

ITC-SMC: A diverse dataset for audio file fragment classification

common_voice_3_0

American English Scripted Monologue Speech Data for Healthcare

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

BirdVox-ANAFCC: A dataset for American Northeast Avian Flight Call...

BirdVox-ANAFCC: A dataset for American Northeast Avian Flight Call Classification

Description

Data Files

Metadata Files

Please acknowledge BirdVox-ANAFCC in academic research

Conditions of Use

Feedback

Versions

Acknowledgement

common_voice_9_0

Fart Recordings Dataset

WOLOF TTS(Text To Speech) Data

Italian Scripted Monologue Speech Data for Healthcare

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

Audio Datasets of belt conveyor rollers in mines

DCASE 2020 Challenge Task 2 Development Dataset

common_voice_12_0See More Versions

Common Voice Corpus 12.0

mozilla-foundation/common_voice_12_0

common_voice_12_0