A special corpus of Indian languages covering 13 major languages of India. It comprises of 10000+ spoken sentences/utterances each of mono and English recorded by both Male and Female native speakers. Speech waveform files are available in .wav format along with the corresponding text. We hope that these recordings will be useful for researchers and speech technologists working on synthesis and recognition. You can request zip archives of the entire database here.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Indic TTS Malayalam Speech Corpus
The Malayalam subset of Indic TTS Corpus, taken from this Kaggle database. The corpus contains one male and one female speaker, with a 2:1 ratio of samples due to missing files for the female speaker. The license is given in the repository.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ROOTS Subset: roots_indic-ta_wikiquote
wikiquote_filtered
Dataset uid: wikiquote_filtered
Description
Homepage
Licensing
Speaker Locations
Sizes
0.0462 % of total 0.1697 % of en 0.0326 % of fr 0.0216 % of ar 0.0066 % of zh 0.0833 % of pt 0.0357 % of es 0.0783 % of indic-ta 0.0361 % of indic-hi 0.0518 % of ca 0.0405 % of vi 0.0834 % of indic-ml 0.0542 % of indic-te 0.1172 % of indic-gu 0.0634 % of… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-ta_wikiquote.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ROOTS Subset: roots_indic-te_wikipedia
wikipedia
Dataset uid: wikipedia
Description
Homepage
Licensing
Speaker Locations
Sizes
3.2299 % of total 4.2071 % of en 5.6773 % of ar 3.3416 % of fr 5.2815 % of es 12.4852 % of ca 0.4288 % of zh 0.4286 % of zh 5.4743 % of indic-bn 8.9062 % of indic-ta 21.3313 % of indic-te 4.4845 % of pt 4.0493 % of indic-hi 11.3163 % of indic-ml 22.5300 % of indic-ur 4.4902 %… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-te_wikipedia.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ROOTS Subset: roots_indic-pa_wikibooks
wikibooks_filtered
Dataset uid: wikibooks_filtered
Description
Homepage
Licensing
Speaker Locations
Sizes
0.0897 % of total 0.2591 % of en 0.0965 % of fr 0.1691 % of es 0.2834 % of indic-hi 0.2172 % of pt 0.0149 % of zh 0.0279 % of ar 0.1374 % of vi 0.5025 % of id 0.3694 % of indic-ur 0.5744 % of eu 0.0769 % of ca 0.0519 % of indic-ta 0.1470 % of indic-mr 0.0751 %… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-pa_wikibooks.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ROOTS Subset: roots_indic-mr_mkb
mkb
Dataset uid: mkb
Description
The Prime Ministers speeches - Mann Ki Baat, on All India Radio, translated into many languages.
Homepage
https://huggingface.co/datasets/mkb http://preon.iiit.ac.in/~jerin/bhasha/
Licensing
Speaker Locations
Sizes
0.0009 % of total 0.0174 % of indic-ta 0.0252 % of indic-ml 0.0416 % of indic-mr 0.0601 % of indic-gu 0.0047 % of indic-bn… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-mr_mkb.
Telugu Indic TTS Dataset
This dataset is derived from the Indic TTS Database project, specifically using the Telugu monolingual recordings from both male and female speakers. The dataset contains high-quality speech recordings with corresponding text transcriptions, making it suitable for text-to-speech (TTS) research and development.
Dataset Details
Language: Telugu Total Duration: ~8.74 hours (Male: 4.47 hours, Female: 4.27 hours) Audio Format: WAV Sampling Rate:… See the full description on the dataset page: https://huggingface.co/datasets/SPRINGLab/IndicTTS_Telugu.
Manipuri Indic TTS Dataset
This dataset is derived from the Indic TTS Database project, specifically using the Manipuri monolingual recordings from both male and female speakers. The dataset contains high-quality speech recordings with corresponding text transcriptions, making it suitable for text-to-speech (TTS) research and development.
Dataset Details
Language: Manipuri Total Duration: ~20.75 hours (Male: 10.61 hours, Female: 10.14 hours) Audio Format: WAV Sampling… See the full description on the dataset page: https://huggingface.co/datasets/SPRINGLab/IndicTTS_Manipuri.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ROOTS Subset: roots_indic-hi_wikiversity
wikiversity_filtered
Dataset uid: wikiversity_filtered
Description
Homepage
Licensing
Speaker Locations
Sizes
0.0367 % of total 0.1050 % of en 0.1178 % of fr 0.1231 % of pt 0.0072 % of zh 0.0393 % of es 0.0076 % of ar 0.0069 % of indic-hi
BigScience processing steps
Filters applied to: en
filter_wiki_user_titles… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-hi_wikiversity.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
A special corpus of Indian languages covering 13 major languages of India. It comprises of 10000+ spoken sentences/utterances each of mono and English recorded by both Male and Female native speakers. Speech waveform files are available in .wav format along with the corresponding text. We hope that these recordings will be useful for researchers and speech technologists working on synthesis and recognition. You can request zip archives of the entire database here.