29 datasets found
  1. The Most Spoken Languages Around the World

    • kaggle.com
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narmelan Tharmalingam (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Narmelan Tharmalingam
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

    It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

    The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

    Content

    The data in this database contains the following attributes:

    • Language - name of the language
    • Total Speakers - this assumes both native and non-native speakers
    • Native Speakers - native speakers of the language
    • Origin - family origin group of said language

    Acknowledgements

    The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

    https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

    Inspiration

    As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.

  2. ASL 20-Words Dataset v1

    • kaggle.com
    Updated Nov 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossam Magdy Balaha (2024). ASL 20-Words Dataset v1 [Dataset]. http://doi.org/10.34740/kaggle/dsv/9797396
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hossam Magdy Balaha
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Arabic Sign Language (ASL) 20-Words Dataset v1 was carefully designed to reflect natural conditions, aiming to capture realistic signing environments and circumstances. Recognizing that nearly everyone has access to a smartphone with a camera as of 2020, the dataset was specifically recorded using mobile phones, aligning with how people commonly record videos in daily life. This approach ensures the dataset is grounded in real-world conditions, enhancing its applicability for practical use cases.

    Each video in this dataset was recorded directly on the authors' smartphones, without any form of stabilization—neither hardware nor software. As a result, the videos vary in resolution and were captured across diverse locations, places, and backgrounds. This variability introduces natural noise and conditions, supporting the development of robust deep learning models capable of generalizing across environments.

    In total, the dataset comprises 8,467 videos of 20 sign language words, contributed by 72 volunteers aged between 20 and 24. Each volunteer performed each sign a minimum of five times, resulting in approximately 100 videos per participant. This repetition standardizes the data and ensures each sign is adequately represented across different performers. The dataset’s mean video count per sign is 423.35, with a standard deviation of 18.58, highlighting the balance and consistency achieved across the signs.

    For reference, Table 2 (in the research article) provides the count of videos for each sign, while Figure 2 (in the research article) offers a visual summary of the statistics for each word in the dataset. Additionally, sample frames from each word are displayed in Figure 3 (in the research article), giving a glimpse of the visual content captured.

    For in-depth insights into the methodology and the dataset's creation, see the research paper: Balaha, M.M., El-Kady, S., Balaha, H.M., et al. (2023). "A vision-based deep learning approach for independent-users Arabic sign language interpretation". Multimedia Tools and Applications, 82, 6807–6826. https://doi.org/10.1007/s11042-022-13423-9

    Please consider citing the following if you use this dataset:

    @misc{balaha_asl_2024_db,
     title={ASL 20-Words Dataset v1},
     url={https://www.kaggle.com/dsv/9783691},
     DOI={10.34740/KAGGLE/DSV/9783691},
     publisher={Kaggle},
     author={Mostafa Magdy Balaha and Sara El-Kady and Hossam Magdy Balaha and Mohamed Salama and Eslam Emad and Muhammed Hassan and Mahmoud M. Saafan},
     year={2024}
    }
    
    @article{balaha2023vision,
     title={A vision-based deep learning approach for independent-users Arabic sign language interpretation},
     author={Balaha, Mostafa Magdy and El-Kady, Sara and Balaha, Hossam Magdy and Salama, Mohamed and Emad, Eslam and Hassan, Muhammed and Saafan, Mahmoud M},
     journal={Multimedia Tools and Applications},
     volume={82},
     number={5},
     pages={6807--6826},
     year={2023},
     publisher={Springer}
    }
    

    This dataset is available under the CC BY-NC-SA 4.0 license, which allows for sharing and adaptation under conditions of non-commercial use, proper attribution, and distribution under the same license.

    For further inquiries or information: https://hossambalaha.github.io/.

  3. F

    Finnish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Finnish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-finnish-finland
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Finland
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Finnish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Finland to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Finnish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Finnish.
    Voice Assistants: Build smart assistants capable of understanding natural Finnish conversations.
    <span

  4. p

    Trends in Reading and Language Arts Proficiency (2010-2022): Top Of The...

    • publicschoolreview.com
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public School Review (2025). Trends in Reading and Language Arts Proficiency (2010-2022): Top Of The World Elementary School vs. California vs. Laguna Beach Unified School District [Dataset]. https://www.publicschoolreview.com/top-of-the-world-elementary-school-profile
    Explore at:
    Dataset updated
    Sep 21, 2025
    Dataset authored and provided by
    Public School Review
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Laguna Beach Unified School District
    Description

    This dataset tracks annual reading and language arts proficiency from 2010 to 2022 for Top Of The World Elementary School vs. California and Laguna Beach Unified School District

  5. Z

    Dataset for: "Big data suggest strong constraints of linguistic similarity...

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Job Schepens (2020). Dataset for: "Big data suggest strong constraints of linguistic similarity on adult language learning" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2863532
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    T. Florian Jaeger
    Job Schepens
    Roeland van Hout
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is adapted from raw data with fully anonymized results on the State Examination of Dutch as a Second Language. This exam is officially administred by the Board of Tests and Examinations (College voor Toetsen en Examens, or CvTE). See cvte.nl/about-cvte. The Board of Tests and Examinations is mandated by the Dutch government.

    The article accompanying the dataset:

    Schepens, Job, Roeland van Hout, and T. Florian Jaeger. “Big Data Suggest Strong Constraints of Linguistic Similarity on Adult Language Learning.” Cognition 194 (January 1, 2020): 104056. https://doi.org/10.1016/j.cognition.2019.104056.

    Every row in the dataset represents the first official testing score of a unique learner. The columns contain the following information as based on questionnaires filled in at the time of the exam:

    "L1" - The first language of the learner "C" - The country of birth "L1L2" - The combination of first and best additional language besides Dutch "L2" - The best additional language besides Dutch "AaA" - Age at Arrival in the Netherlands in years (starting date of residence) "LoR" - Length of residence in the Netherlands in years "Edu.day" - Duration of daily education (1 low, 2 middle, 3 high, 4 very high). From 1992 until 2006, learners' education has been measured by means of a side-by-side matrix question in a learner's questionnaire. Learners were asked to mark which type of education they have had (elementary, secondary, or tertiary schooling) by means of filling in for how many years they have been enrolled, in which country, and whether or not they have graduated. Based on this information we were able to estimate how many years learners have had education on a daily basis from six years of age onwards. Since 2006, the question about learners' education has been altered and it is asked directly how many years learners have had formal education on a daily basis from six years of age onwards. Possible answering categories are: 1) 0 thru 5 years; 2) 6 thru 10 years; 3) 11 thru 15 years; 4) 16 years or more. The answers have been merged into the categorical answer. "Sex" - Gender "Family" - Language Family "ISO639.3" - Language ID code according to Ethnologue "Enroll" - Proportion of school-aged youth enrolled in secondary education according to the World Bank. The World Bank reports on education data in a wide number of countries around the world on a regular basis. We took the gross enrollment rate in secondary schooling per country in the year the learner has arrived in the Netherlands as an indicator for a country's educational accessibility at the time learners have left their country of origin. "STEX_speaking_score" - The STEX test score for speaking proficiency. "Dissimilarity_morphological" - Morphological similarity "Dissimilarity_lexical" - Lexical similarity "Dissimilarity_phonological_new_features" - Phonological similarity (in terms of new features) "Dissimilarity_phonological_new_categories" - Phonological similarity (in terms of new sounds)

    A few rows of the data:

    "L1","C","L1L2","L2","AaA","LoR","Edu.day","Sex","Family","ISO639.3","Enroll","STEX_speaking_score","Dissimilarity_morphological","Dissimilarity_lexical","Dissimilarity_phonological_new_features","Dissimilarity_phonological_new_categories" "English","UnitedStates","EnglishMonolingual","Monolingual",34,0,4,"Female","Indo-European","eng ",94,541,0.0094,0.083191,11,19 "English","UnitedStates","EnglishGerman","German",25,16,3,"Female","Indo-European","eng ",94,603,0.0094,0.083191,11,19 "English","UnitedStates","EnglishFrench","French",32,3,4,"Male","Indo-European","eng ",94,562,0.0094,0.083191,11,19 "English","UnitedStates","EnglishSpanish","Spanish",27,8,4,"Male","Indo-European","eng ",94,537,0.0094,0.083191,11,19 "English","UnitedStates","EnglishMonolingual","Monolingual",47,5,3,"Male","Indo-European","eng ",94,505,0.0094,0.083191,11,19

  6. F

    Mandarin General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Mandarin speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.
    Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  7. Most popular database management systems worldwide 2024

    • statista.com
    • tokrwards.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 2024
    Area covered
    Worldwide
    Description

    As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.

  8. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • data.niaid.nih.gov
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Firmansyah, Asep Fajar
    Gusmita, Ria Hari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  9. The GDELT Project

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The GDELT Project (2019). The GDELT Project [Dataset]. https://www.kaggle.com/datasets/gdelt/gdelt
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset authored and provided by
    The GDELT Project
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created. Just the 2015 data alone records nearly three quarters of a trillion emotional snapshots and more than 1.5 billion location references, while its total archives span more than 215 years, making it one of the largest open-access spatio-temporal datasets in existance and pushing the boundaries of "big data" study of global human society. Its Global Knowledge Graph connects the world's people, organizations, locations, themes, counts, images and emotions into a single holistic network over the entire planet. How can you query, explore, model, visualize, interact, and even forecast this vast archive of human society?

    Content

    GDELT 2.0 has a wealth of features in the event database which includes events reported in articles published in 65 live translated languages, measurements of 2,300 emotions and themes, high resolution views of the non-Western world, relevant imagery, videos, and social media embeds, quotes, names, amounts, and more.

    You may find these code books helpful:
    GDELT Global Knowledge Graph Codebook V2.1 (PDF)
    GDELT Event Codebook V2.0 (PDF)

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. [Fork this kernel to get started][98] to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    You may redistribute, rehost, republish, and mirror any of the GDELT datasets in any form. However, any use or redistribution of the data must include a citation to the GDELT Project and a link to the website (https://www.gdeltproject.org/).

  10. F

    Saudi Arabian Arabic General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Saudi Arabian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-saudiarabia
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Saudi Arabia
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Saudi Arabian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Saudi Arabian Arabic communication.

    Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Saudi accents and dialects.

    Speech Data

    The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Saudi Arabian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 80 verified native Saudi Arabian Arabic speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Saudi Arabia to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Arabic speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Saudi Arabian Arabic.
    Voice Assistants: Build smart assistants capable of understanding natural Saudi conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display:

  11. F

    Algeria Arabic General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Algeria Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-algeria
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Algeria
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Algerian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Algerian Arabic communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Algerian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Algerian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Algerian Arabic speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Algeria to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Arabic speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Algerian Arabic.
    Voice Assistants: Build smart assistants capable of understanding natural Algerian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  12. F

    Italian General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Italian General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-italian-italy
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Italian General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Italian speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Italian communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Italian speech models that understand and respond to authentic Italian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Italian. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Italian speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Italy to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Italian speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Italian.
    Voice Assistants: Build smart assistants capable of understanding natural Italian conversations.
    <span

  13. F

    Egyptian Arabic General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Egyptian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-egypt
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.

    Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.

    Speech Data

    The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 80 verified native Egyptian Arabic speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Egypt to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Arabic speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Egyptian Arabic.
    Voice Assistants: Build smart assistants capable of understanding natural Egyptian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  14. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  15. F

    Spanish(Spain) General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish(Spain) General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-spain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Spain
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Spanish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Spain to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Spanish conversations.
    <span

  16. F

    Hindi General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-hindi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Hindi General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Hindi speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Hindi communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Hindi speech models that understand and respond to authentic Indian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Hindi. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Hindi speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Hindi speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Hindi.
    Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.
    <span

  17. F

    Egyptian Arabic Call Center Data for Telecom AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Egyptian Arabic Call Center Data for Telecom AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/telecom-call-center-conversation-arabic-egypt
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Egypt
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Egyptian Arabic Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Arabic-speaking telecom customers. Featuring over 40 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.

    Speech Data

    The dataset contains 40 hours of dual-channel call center recordings between native Egyptian Arabic speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.

    Participant Diversity:
    Speakers: 80 native Egyptian Arabic speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Egypt to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.

    Inbound Calls:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refund Requests and Billing Adjustments
    Emergency Service Access, and others
    Outbound Calls:
    Welcome Calls & Onboarding
    Payment Reminders
    Customer Satisfaction Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Status Calls, and more

    This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, coughs)
    High transcription accuracy with word error rate < 5% thanks to dual-layered quality checks.

    These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  18. F

    Vietnamese Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Vietnamese Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-vietnamese-vietnam
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Vietnamese Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Vietnamese speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Vietnamese speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Vietnamese speakers from our contributor community.
    Regions: Diverse provinces across Vietnam to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

    <b

  19. F

    Urdu General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Urdu General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-urdu-pakistan
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Urdu General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Urdu speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Urdu communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Urdu speech models that understand and respond to authentic Urdu accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Urdu. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Urdu speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Pakistan to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Urdu speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Urdu.
    Voice Assistants: Build smart assistants capable of understanding natural Urdu conversations.

  20. F

    Dutch Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Dutch Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-dutch-netherlands
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Dutch Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Dutch speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Dutch speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Dutch speakers from our contributor community.
    Regions: Diverse provinces across Netherlands to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

    <b style="font-weight:

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Narmelan Tharmalingam (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world/code
Organization logo

The Most Spoken Languages Around the World

Top 100 Languages Spoken - Native and Non Native Speakers

Explore at:
39 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Narmelan Tharmalingam
License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Area covered
World
Description

Context

After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

Content

The data in this database contains the following attributes:

  • Language - name of the language
  • Total Speakers - this assumes both native and non-native speakers
  • Native Speakers - native speakers of the language
  • Origin - family origin group of said language

Acknowledgements

The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

Inspiration

As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.

Search
Clear search
Close search
Google apps
Main menu