13 datasets found
  1. Spoken Language Identification

    • kaggle.com
    zip
    Updated Jul 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomasz (2018). Spoken Language Identification [Dataset]. https://www.kaggle.com/toponowicz/spoken-language-identification
    Explore at:
    zip(16022179692 bytes)Available download formats
    Dataset updated
    Jul 5, 2018
    Authors
    Tomasz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset contains speech samples of English, German and Spanish languages. Samples are equally balanced between languages, genders and speakers.

    More information at the spoken-language-dataset repository.

    Background

    The project was inspired by the TopCoder contest, Spoken Languages 2. The given dataset contains 10 second of speech recorded in 1 of 176 languages. The entire dataset has been based on bible readings. Poorly, in many cases there is a single speaker per language (male in most cases). Even worse the same single speaker exists in the test set. Of course this can't lead to a good generic solution.

    There are two ways we can take:

    • First approach is to use a big dataset where all voice or language properties (e.g. gender, age, accent) become equally possible. A good example is the Common Voice from Mozilla. Most likely this leads to the best performance. However processing such a huge dataset is expensive and adding new languages is challenging.
    • Second approach is to use a small handcrafted dataset and boost it with data augmentation. The advantage is that we can add new languages quickly. Last but not least the dataset is small thus it can be processed quickly.

    The second approach has been taken.

    LibriVox recordings were used to prepare the dataset. Particular attention was paid to a big variety of unique speakers. Big variance forces the model to concentrate more on language properties than a specific voice. Samples are equally balanced between languages, genders and speakers in order not to favour any subgroup. Finally the dataset is divided into train and test set. Speakers present in the test set, are not present in the train set. This helps estimate a generalization error.

    The core of the train set is based on 420 minutes (2520 samples) of original recordings. After applying several audio transformations (pitch, speed and noise) the train set was extended to 12180 minutes (73080 samples). The test set contains 90 minutes (540 samples) of original recordings. No data augmentation has been applied.

    Original recordings contain 90 unique speakers. The number of unique speakers was increased by adjusting pitch (8 different levels) and speed (8 different levels). After applying audio transformations there are 1530 unique speakers.

    Data structure

    The dataset is divided into 2 directories:

    • train (73080 samples)
    • test (540 samples)

    Each sample is an FLAC audio file with:

    • sample rate: 22050
    • bit depth: 16
    • channels: 1
    • duration: 10 seconds (sharp)

    The original recordings are MP3 files but they are converted into FLAC files quickly to avoid re-encoding (and losing quality) during transformations.

    The filename of the sample has following syntax:

    (language)_(gender)_(recording ID).fragment(index)[.(transformation)(index)].flac
    

    ...and variables:

    • language: en, de, or es
    • gender: m or f
    • recording ID: a hash of the URL
    • fragment index: 1-30
    • transformation: speed, pitch or noise
    • transformation index:
      • if speed: 1-8
      • if pitch: 1-8
      • if noise: 1-12

    For example:

    es_m_f7d959494477e5e7e33d4666f15311c9.fragment9.speed8.flac
    

    Sample Model

    The dataset was used to train the spoken language identification model. The trained model has 97% score (i.e. F1 metric) against the test set. Additionally it generalizes well which was confirmed against real life content. The fact that samples are prefeclty stratified was one of the reasons to achieve such a high performance.

    Feel free to create your own model and share results!

  2. N

    Population and Languages of the Limited English Proficient (LEP) Speakers by...

    • data.cityofnewyork.us
    • catalog.data.gov
    csv, xlsx, xml
    Updated Apr 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Civic Engagement Commission (CEC) (2022). Population and Languages of the Limited English Proficient (LEP) Speakers by Community District [Dataset]. https://data.cityofnewyork.us/City-Government/Population-and-Languages-of-the-Limited-English-Pr/ajin-gkbp
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Apr 25, 2022
    Dataset authored and provided by
    Civic Engagement Commission (CEC)
    Description

    Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.

  3. F

    New Zealand English TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). New Zealand English TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-english-newzealand
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    New Zealand
    Dataset funded by
    FutureBeeAI
    Description

    The English TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native English voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    SNR: Minimum 30 dB
    Channel: Mono
    Recording Duration: 20-30 minutes
    Recording Environment: Studio-controlled, acoustically treated rooms
    Per Speaker Volume: 1–2 hours of speech per artist
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native English speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    Artist Profile:
    Gender: Male and Female
    Age Range: 20–60 years
    Regions: Native English-speaking states from New Zealand
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    Word Count per Script: 3,000–5,000 words per 30-minute session
    Content Types:
    Storytelling
    Script and book reading
    Informational explainers
    Government service instructions
    E-commerce tutorials
    Motivational content
    Health & wellness guides
    Education & career advice
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    Format: Available in plain text and JSON
    Post-processing:
    Corrected for

  4. English language corpora.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). English language corpora. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  5. Common English Parts-of-speech

    • kaggle.com
    zip
    Updated Nov 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Common English Parts-of-speech [Dataset]. https://www.kaggle.com/datasets/thedevastator/common-english-parts-of-speech
    Explore at:
    zip(1253764 bytes)Available download formats
    Dataset updated
    Nov 3, 2022
    Authors
    The Devastator
    Description

    Common English Parts-of-speech

    Over 8,000 words and their plural forms

    About this dataset

    This dataset provides ample information on over 8,000 various English words, including nouns and their plural forms. By mining this data, researchers can gain valuable insights into understanding the English language in a more efficient way

    How to use the dataset

    This dataset can be used to help researchers understand the English language in a new and innovative way. The data includes information on over 8,000 different English words, including nouns and their plural forms. This dataset is particularly useful for investigating the relationships between words and their plural forms

    Research Ideas

    • To create a program that can automatically generating plural forms of nouns.
    • To study the relationships between different words and their plural forms.
    • To develop a better understanding of the English language for non-native speakers

    Acknowledgements

    License

    See the dataset description for more information.

    Columns

    File: adjectives.csv

    File: adverbs.csv

    File: nouns.csv | Column name | Description | |:--------------|:-----------------------------------------------------------| | 007 | The code name of the character. (String) | | 007s | The number of times the character has been used. (Integer) |

    File: plural-nouns.csv

    File: verbs.csv | Column name | Description | |:--------------|:-----------------------------| | awake | (adjective) to stop sleeping | | awoke | (verb) to stop sleeping | | awoken | (verb) to stop sleeping |

    File: words-multiple-present-participle.csv | Column name | Description | |:-----------------------------------|:-------------------------------------------------------------| | Word | The word being described. (String) | | Present Participle | The present participle form of the word. (String) | | Present Participle Alternative | An alternative present participle form of the word. (String) |

  6. BLEU scores for different datasets in different languages.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). BLEU scores for different datasets in different languages. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BLEU scores for different datasets in different languages.

  7. f

    Manual Inspection and Correction.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Manual Inspection and Correction. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  8. F

    New Zealand Retail Scripted Monologue Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). New Zealand Retail Scripted Monologue Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/retail-scripted-speech-monologues-english-newzealand
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    New Zealand
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the New Zealand English Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of English language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.

    Speech Data

    This training dataset includes 6,000+ high-quality scripted audio recordings in New Zealand English, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.

    Participant Diversity
    Speakers: 60 native English speakers from across New Zealand
    Geographic Coverage: Multiple New Zealand regions to ensure dialect and accent diversity
    Demographics: Participants aged 18 to 70, with a 60:40 male-to-female distribution
    Recording Details
    Nature of Recording: Scripted monologue-style speech prompts
    Duration: Each recording spans 5 to 30 seconds
    Audio Format: WAV format, mono channel, 16-bit depth, and 8kHz / 16kHz sample rates
    Environment: Recorded in quiet conditions, free from background noise and echo

    Topic Diversity

    This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:

    Customer Service Interactions
    Order Placement and Payment Processes
    Product and Service Inquiries
    Technical Support Queries
    General Information and Guidance
    Promotional and Sales Announcements
    Domain-Specific Service Statements

    Contextual Enrichment

    To increase training utility, prompts include contextual data such as:

    Region-Specific Names: Common New Zealand male and female names in diverse formats
    Addresses: Localized address variations spoken naturally
    Dates & Times: Realistic phrasing in delivery, promotions, and return policies
    Product References: Real-world product names, brands, and categories
    Numerical Data: Spoken numbers and prices used in transactions and offers
    Order IDs & Tracking Numbers: Common references in customer service calls

    These additions help your models learn to recognize structured and unstructured retail-related speech.

    Transcription

    Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.

    Content: Exact scripted prompts as spoken by the participant
    Format: Provided in plain text (.TXT) format with filenames matching the associated audio
    Quality Assurance: All transcripts are verified for accuracy by native English transcribers

    Metadata

    Detailed metadata is included to support filtering, analysis, and model evaluation:

    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  9. Parameter of Resnet-50 and NASNetLarge.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Parameter of Resnet-50 and NASNetLarge. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t011
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  10. f

    Comparison of Urdu Image Captioning Studies.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Comparison of Urdu Image Captioning Studies. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  11. Feature and textural extraction model with random weights and urdu vector...

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Feature and textural extraction model with random weights and urdu vector for a multimodal approach. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature and textural extraction model with random weights and urdu vector for a multimodal approach.

  12. Division of training and testing images based on a split ratio.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Division of training and testing images based on a split ratio. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Division of training and testing images based on a split ratio.

  13. Time elapsed during training of ResNet-50-LSTM and NASNetLarge-LSTM model.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Time elapsed during training of ResNet-50-LSTM and NASNetLarge-LSTM model. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Time elapsed during training of ResNet-50-LSTM and NASNetLarge-LSTM model.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tomasz (2018). Spoken Language Identification [Dataset]. https://www.kaggle.com/toponowicz/spoken-language-identification
Organization logo

Spoken Language Identification

Speech samples of English, German and Spanish languages.

Explore at:
zip(16022179692 bytes)Available download formats
Dataset updated
Jul 5, 2018
Authors
Tomasz
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

The dataset contains speech samples of English, German and Spanish languages. Samples are equally balanced between languages, genders and speakers.

More information at the spoken-language-dataset repository.

Background

The project was inspired by the TopCoder contest, Spoken Languages 2. The given dataset contains 10 second of speech recorded in 1 of 176 languages. The entire dataset has been based on bible readings. Poorly, in many cases there is a single speaker per language (male in most cases). Even worse the same single speaker exists in the test set. Of course this can't lead to a good generic solution.

There are two ways we can take:

  • First approach is to use a big dataset where all voice or language properties (e.g. gender, age, accent) become equally possible. A good example is the Common Voice from Mozilla. Most likely this leads to the best performance. However processing such a huge dataset is expensive and adding new languages is challenging.
  • Second approach is to use a small handcrafted dataset and boost it with data augmentation. The advantage is that we can add new languages quickly. Last but not least the dataset is small thus it can be processed quickly.

The second approach has been taken.

LibriVox recordings were used to prepare the dataset. Particular attention was paid to a big variety of unique speakers. Big variance forces the model to concentrate more on language properties than a specific voice. Samples are equally balanced between languages, genders and speakers in order not to favour any subgroup. Finally the dataset is divided into train and test set. Speakers present in the test set, are not present in the train set. This helps estimate a generalization error.

The core of the train set is based on 420 minutes (2520 samples) of original recordings. After applying several audio transformations (pitch, speed and noise) the train set was extended to 12180 minutes (73080 samples). The test set contains 90 minutes (540 samples) of original recordings. No data augmentation has been applied.

Original recordings contain 90 unique speakers. The number of unique speakers was increased by adjusting pitch (8 different levels) and speed (8 different levels). After applying audio transformations there are 1530 unique speakers.

Data structure

The dataset is divided into 2 directories:

  • train (73080 samples)
  • test (540 samples)

Each sample is an FLAC audio file with:

  • sample rate: 22050
  • bit depth: 16
  • channels: 1
  • duration: 10 seconds (sharp)

The original recordings are MP3 files but they are converted into FLAC files quickly to avoid re-encoding (and losing quality) during transformations.

The filename of the sample has following syntax:

(language)_(gender)_(recording ID).fragment(index)[.(transformation)(index)].flac

...and variables:

  • language: en, de, or es
  • gender: m or f
  • recording ID: a hash of the URL
  • fragment index: 1-30
  • transformation: speed, pitch or noise
  • transformation index:
    • if speed: 1-8
    • if pitch: 1-8
    • if noise: 1-12

For example:

es_m_f7d959494477e5e7e33d4666f15311c9.fragment9.speed8.flac

Sample Model

The dataset was used to train the spoken language identification model. The trained model has 97% score (i.e. F1 metric) against the test set. Additionally it generalizes well which was confirmed against real life content. The fact that samples are prefeclty stratified was one of the reasons to achieve such a high performance.

Feel free to create your own model and share results!

Search
Clear search
Close search
Google apps
Main menu