Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains speech samples of English, German and Spanish languages. Samples are equally balanced between languages, genders and speakers.
More information at the spoken-language-dataset repository.
The project was inspired by the TopCoder contest, Spoken Languages 2. The given dataset contains 10 second of speech recorded in 1 of 176 languages. The entire dataset has been based on bible readings. Poorly, in many cases there is a single speaker per language (male in most cases). Even worse the same single speaker exists in the test set. Of course this can't lead to a good generic solution.
There are two ways we can take:
The second approach has been taken.
LibriVox recordings were used to prepare the dataset. Particular attention was paid to a big variety of unique speakers. Big variance forces the model to concentrate more on language properties than a specific voice. Samples are equally balanced between languages, genders and speakers in order not to favour any subgroup. Finally the dataset is divided into train and test set. Speakers present in the test set, are not present in the train set. This helps estimate a generalization error.
The core of the train set is based on 420 minutes (2520 samples) of original recordings. After applying several audio transformations (pitch, speed and noise) the train set was extended to 12180 minutes (73080 samples). The test set contains 90 minutes (540 samples) of original recordings. No data augmentation has been applied.
Original recordings contain 90 unique speakers. The number of unique speakers was increased by adjusting pitch (8 different levels) and speed (8 different levels). After applying audio transformations there are 1530 unique speakers.
The dataset is divided into 2 directories:
Each sample is an FLAC audio file with:
The original recordings are MP3 files but they are converted into FLAC files quickly to avoid re-encoding (and losing quality) during transformations.
The filename of the sample has following syntax:
(language)_(gender)_(recording ID).fragment(index)[.(transformation)(index)].flac
...and variables:
en, de, or esm or fspeed, pitch or noisespeed: 1-8pitch: 1-8noise: 1-12For example:
es_m_f7d959494477e5e7e33d4666f15311c9.fragment9.speed8.flac
The dataset was used to train the spoken language identification model. The trained model has 97% score (i.e. F1 metric) against the test set. Additionally it generalizes well which was confirmed against real life content. The fact that samples are prefeclty stratified was one of the reasons to achieve such a high performance.
Feel free to create your own model and share results!
Facebook
TwitterMany residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native English voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.
Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.
All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.
Only clean, production-grade audio makes it into the final dataset.
All voice artists are native English speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.
Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.
While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
Facebook
TwitterThis dataset provides ample information on over 8,000 various English words, including nouns and their plural forms. By mining this data, researchers can gain valuable insights into understanding the English language in a more efficient way
This dataset can be used to help researchers understand the English language in a new and innovative way. The data includes information on over 8,000 different English words, including nouns and their plural forms. This dataset is particularly useful for investigating the relationships between words and their plural forms
See the dataset description for more information.
File: adjectives.csv
File: adverbs.csv
File: nouns.csv | Column name | Description | |:--------------|:-----------------------------------------------------------| | 007 | The code name of the character. (String) | | 007s | The number of times the character has been used. (Integer) |
File: plural-nouns.csv
File: verbs.csv | Column name | Description | |:--------------|:-----------------------------| | awake | (adjective) to stop sleeping | | awoke | (verb) to stop sleeping | | awoken | (verb) to stop sleeping |
File: words-multiple-present-participle.csv | Column name | Description | |:-----------------------------------|:-------------------------------------------------------------| | Word | The word being described. (String) | | Present Participle | The present participle form of the word. (String) | | Present Participle Alternative | An alternative present participle form of the word. (String) |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BLEU scores for different datasets in different languages.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the New Zealand English Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of English language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.
This training dataset includes 6,000+ high-quality scripted audio recordings in New Zealand English, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.
This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:
To increase training utility, prompts include contextual data such as:
These additions help your models learn to recognize structured and unstructured retail-related speech.
Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.
Detailed metadata is included to support filtering, analysis, and model evaluation:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature and textural extraction model with random weights and urdu vector for a multimodal approach.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Division of training and testing images based on a split ratio.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time elapsed during training of ResNet-50-LSTM and NASNetLarge-LSTM model.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains speech samples of English, German and Spanish languages. Samples are equally balanced between languages, genders and speakers.
More information at the spoken-language-dataset repository.
The project was inspired by the TopCoder contest, Spoken Languages 2. The given dataset contains 10 second of speech recorded in 1 of 176 languages. The entire dataset has been based on bible readings. Poorly, in many cases there is a single speaker per language (male in most cases). Even worse the same single speaker exists in the test set. Of course this can't lead to a good generic solution.
There are two ways we can take:
The second approach has been taken.
LibriVox recordings were used to prepare the dataset. Particular attention was paid to a big variety of unique speakers. Big variance forces the model to concentrate more on language properties than a specific voice. Samples are equally balanced between languages, genders and speakers in order not to favour any subgroup. Finally the dataset is divided into train and test set. Speakers present in the test set, are not present in the train set. This helps estimate a generalization error.
The core of the train set is based on 420 minutes (2520 samples) of original recordings. After applying several audio transformations (pitch, speed and noise) the train set was extended to 12180 minutes (73080 samples). The test set contains 90 minutes (540 samples) of original recordings. No data augmentation has been applied.
Original recordings contain 90 unique speakers. The number of unique speakers was increased by adjusting pitch (8 different levels) and speed (8 different levels). After applying audio transformations there are 1530 unique speakers.
The dataset is divided into 2 directories:
Each sample is an FLAC audio file with:
The original recordings are MP3 files but they are converted into FLAC files quickly to avoid re-encoding (and losing quality) during transformations.
The filename of the sample has following syntax:
(language)_(gender)_(recording ID).fragment(index)[.(transformation)(index)].flac
...and variables:
en, de, or esm or fspeed, pitch or noisespeed: 1-8pitch: 1-8noise: 1-12For example:
es_m_f7d959494477e5e7e33d4666f15311c9.fragment9.speed8.flac
The dataset was used to train the spoken language identification model. The trained model has 97% score (i.e. F1 metric) against the test set. Additionally it generalizes well which was confirmed against real life content. The fact that samples are prefeclty stratified was one of the reasons to achieve such a high performance.
Feel free to create your own model and share results!