96 datasets found
  1. A

    ‘Languages spoken across various nations’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Languages spoken across various nations’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-languages-spoken-across-various-nations-a8e8/latest
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Languages spoken across various nations’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shubhamptrivedi/languages-spoken-across-various-nations on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    I was fascinated by this type of data as this gives a slight peek on cultural diversity of a nation and what kind of literary work to be expected from that nation

    Content

    This dataset is a collection of all the languages that are spoken by the different nations around the world. Nowadays, Most nations are bi or even trilingual in nature this can be due to different cultures and different groups of people are living in the same nation in harmony. This type of data can be very useful for linguistic research, market research, advertising purposes, and the list goes on.

    Acknowledgements

    This dataset was published on the site Infoplease which is a general information website.

    Inspiration

    I think this dataset can be useful to understand which type of literature publication can be done for maximum penetration of the market base

    --- Original source retains full ownership of the source dataset ---

  2. 785 Million Language Translation Database for AI

    • kaggle.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  3. h

    jampatoisnli

    • huggingface.co
    Updated Jul 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2023
    Authors
    Ruth-Ann Armstrong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.

  4. h

    XLingHealth

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Tech CLAWS Lab (2024). XLingHealth [Dataset]. https://huggingface.co/datasets/claws-lab/XLingHealth
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Georgia Tech CLAWS Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "XLingHealth"

    XLingHealth is a Cross-Lingual Healthcare benchmark for clinical health inquiry that features the top four most spoken languages in the world: English, Spanish, Chinese, and Hindi.

      Statistics
    

    Dataset

    Examples

    Words (Q)

    Words (A)

    HealthQA 1,134 7.72 ± 2.41 242.85 ± 221.88

    LiveQA 246 41.76 ± 37.38 115.25 ± 112.75

    MedicationQA 690 6.86 ± 2.83 61.50 ± 69.44

    Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.

  5. E

    GlobalPhone Vietnamese

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Vietnamese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0322/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorde...

  6. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  7. f

    English language corpora.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). English language corpora. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  8. f

    BLEU scores for different datasets in different languages.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). BLEU scores for different datasets in different languages. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BLEU scores for different datasets in different languages.

  9. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • data.niaid.nih.gov
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Gusmita, Ria Hari
    Firmansyah, Asep Fajar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  10. Language spoken at Home (Census 2016)

    • digital-earth-pacificcore.hub.arcgis.com
    • cacgeoportal.com
    • +1more
    Updated May 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/datasets/esriau::language-spoken-at-home-census-2016/about
    Explore at:
    Dataset updated
    May 26, 2019
    Dataset provided by
    Esrihttp://esri.com/
    Authors
    Esri Australia
    Description

    Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.

  11. f

    Manual Inspection and Correction.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Manual Inspection and Correction. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  12. F

    Canadian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-canada
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Canadian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Canadian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Canadian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Canadian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Canada to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Canadian English.
    Voice Assistants: Build smart assistants capable of understanding natural Canadian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  13. Global Country Information 2023

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nidula Elgiriyewithana; Nidula Elgiriyewithana (2024). Global Country Information 2023 [Dataset]. http://doi.org/10.5281/zenodo.8165229
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nidula Elgiriyewithana; Nidula Elgiriyewithana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

    Key Features

    • Country: Name of the country.
    • Density (P/Km2): Population density measured in persons per square kilometer.
    • Abbreviation: Abbreviation or code representing the country.
    • Agricultural Land (%): Percentage of land area used for agricultural purposes.
    • Land Area (Km2): Total land area of the country in square kilometers.
    • Armed Forces Size: Size of the armed forces in the country.
    • Birth Rate: Number of births per 1,000 population per year.
    • Calling Code: International calling code for the country.
    • Capital/Major City: Name of the capital or major city.
    • CO2 Emissions: Carbon dioxide emissions in tons.
    • CPI: Consumer Price Index, a measure of inflation and purchasing power.
    • CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
    • Currency_Code: Currency code used in the country.
    • Fertility Rate: Average number of children born to a woman during her lifetime.
    • Forested Area (%): Percentage of land area covered by forests.
    • Gasoline_Price: Price of gasoline per liter in local currency.
    • GDP: Gross Domestic Product, the total value of goods and services produced in the country.
    • Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
    • Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
    • Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
    • Largest City: Name of the country's largest city.
    • Life Expectancy: Average number of years a newborn is expected to live.
    • Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
    • Minimum Wage: Minimum wage level in local currency.
    • Official Language: Official language(s) spoken in the country.
    • Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
    • Physicians per Thousand: Number of physicians per thousand people.
    • Population: Total population of the country.
    • Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
    • Tax Revenue (%): Tax revenue as a percentage of GDP.
    • Total Tax Rate: Overall tax burden as a percentage of commercial profits.
    • Unemployment Rate: Percentage of the labor force that is unemployed.
    • Urban Population: Percentage of the population living in urban areas.
    • Latitude: Latitude coordinate of the country's location.
    • Longitude: Longitude coordinate of the country's location.

    Potential Use Cases

    • Analyze population density and land area to study spatial distribution patterns.
    • Investigate the relationship between agricultural land and food security.
    • Examine carbon dioxide emissions and their impact on climate change.
    • Explore correlations between economic indicators such as GDP and various socio-economic factors.
    • Investigate educational enrollment rates and their implications for human capital development.
    • Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
    • Study labor market dynamics through indicators such as labor force participation and unemployment rates.
    • Investigate the role of taxation and its impact on economic development.
    • Explore urbanization trends and their social and environmental consequences.
  14. f

    Hyper-parameter of the trained model.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Hyper-parameter of the trained model. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  15. E

    GlobalPhone Portuguese (Brazilian)

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Brazil
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).

  16. s

    Speech dataset

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Speech dataset [Dataset]. https://marketplace.sshopencloud.eu/dataset/8ixgG5
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    The real-world speech data set consists of 3686 segments of English speech spoken with different accents. This dataset is provided by the Speech Processing Group at Brno University of Technology, Czech Republic. The majority data corresponds to American accent and only 1.65% corresponds to one of seven other accents (these are referred to as outliers). The speech segments are represented by 400-dimensional so called i-vectors which are widely used state-of-the-art features for speaker and language recognition. Learing Outlier Ensembles: The Best of Both Worlds – Supervised and Unsupervised. Barbora Micenkova, Brian McWilliams, and Ira Assent, KDD ODD2 Workshop, 2014.

  17. m

    Pashtu Language Digits Dataset (PLDD)

    • data.mendeley.com
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khalil khan (2022). Pashtu Language Digits Dataset (PLDD) [Dataset]. http://doi.org/10.17632/zbyc7sgp63.2
    Explore at:
    Dataset updated
    Mar 25, 2022
    Authors
    khalil khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pashtu is a language spoken by more than 50 million people in the world. It is also the national language of Afghanistan. In the two largest provinces of Pakistan (Khyber Pakhtun Khwa and Baluchistan) Pashtu is also spoken. Although the optical character recognition system of the other languages is in very developed form, for the Pashtu language very rare work has been reported. As in the initial step, we are introducing this dataset for digits recognition.

  18. A

    ‘Extinct Languages’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Extinct Languages’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-extinct-languages-6686/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Extinct Languages’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/the-guardian/extinct-languages on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    A recent Guardian blog post asks: "How many endangered languages are there in the World and what are the chances they will die out completely?" The United Nations Education, Scientific and Cultural Organisation (UNESCO) regularly publishes a list of endangered languages, using a classification system that describes its danger (or completion) of extinction.

    Content

    The full detailed dataset includes names of languages, number of speakers, the names of countries where the language is still spoken, and the degree of endangerment. The UNESCO endangerment classification is as follows:

    • Vulnerable: most children speak the language, but it may be restricted to certain domains (e.g., home)
    • Definitely endangered: children no longer learn the language as a 'mother tongue' in the home
    • Severely endangered: language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves
    • Critically endangered: the youngest speakers are grandparents and older, and they speak the language partially and infrequently
    • Extinct: there are no speakers left

    Acknowledgements

    Data was originally organized and published by The Guardian, and can be accessed via this Datablog post.

    Inspiration

    • How can you best visualize this data?
    • Which rare languages are more isolated (Sicilian, for example) versus more spread out? Can you come up with a hypothesis for why that is the case?
    • Can you compare the number of rare speakers with more relatable figures? For example, are there more Romani speakers in the world than there are residents in a small city in the United States?

    --- Original source retains full ownership of the source dataset ---

  19. f

    Division of training and testing images based on a split ratio.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Division of training and testing images based on a split ratio. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Division of training and testing images based on a split ratio.

  20. F

    Mandarin General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Mandarin speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.
    Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Languages spoken across various nations’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-languages-spoken-across-various-nations-a8e8/latest

‘Languages spoken across various nations’ analyzed by Analyst-2

Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘Languages spoken across various nations’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shubhamptrivedi/languages-spoken-across-various-nations on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

I was fascinated by this type of data as this gives a slight peek on cultural diversity of a nation and what kind of literary work to be expected from that nation

Content

This dataset is a collection of all the languages that are spoken by the different nations around the world. Nowadays, Most nations are bi or even trilingual in nature this can be due to different cultures and different groups of people are living in the same nation in harmony. This type of data can be very useful for linguistic research, market research, advertising purposes, and the list goes on.

Acknowledgements

This dataset was published on the site Infoplease which is a general information website.

Inspiration

I think this dataset can be useful to understand which type of literature publication can be done for maximum penetration of the market base

--- Original source retains full ownership of the source dataset ---

Search
Clear search
Close search
Google apps
Main menu