92 datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. 🌍📚 World Languages Dataset 🌍📚

    • kaggle.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waqar Ali (2024). 🌍📚 World Languages Dataset 🌍📚 [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset
    Explore at:
    zip(5706 bytes)Available download formats
    Dataset updated
    Jul 30, 2024
    Authors
    Waqar Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    World
    Description

    This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

    The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.

  3. Number of native Spanish speakers worldwide 2024, by country

    • statista.com
    • boostndoto.org
    • +5more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  4. Most spoken languages worldwide in Millions

    • kaggle.com
    zip
    Updated Oct 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batros Jamali (2023). Most spoken languages worldwide in Millions [Dataset]. https://www.kaggle.com/datasets/batrosjamali/most-spoken-languages-worldwide-in-millions
    Explore at:
    zip(585 bytes)Available download formats
    Dataset updated
    Oct 14, 2023
    Authors
    Batros Jamali
    Area covered
    World
    Description

    Dataset

    This dataset was created by Batros Jamali

    Contents

  5. Common languages used for web content 2025, by share of websites

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2025
    Area covered
    Worldwide
    Description

    As of October 2025, English was the dominant language for online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of web content, followed by German with 5.9 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  6. g

    ENGLISH PROFICIENCY LEVEL

    • global-relocate.com
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Global Relocate (2024). ENGLISH PROFICIENCY LEVEL [Dataset]. https://global-relocate.com/rankings/english-proficiency-level
    Explore at:
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Global Relocate
    Description

    Using data from reports such as the "English Proficiency Index" (EDU) from Education First, one can see the significant impact of culture, education and globalization on the ability of citizens of different countries to speak English.

  7. F

    British English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the UK English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world UK English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic British accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of UK English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native UK English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United Kingdom to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for UK English.
    Voice Assistants: Build smart assistants capable of understanding natural British conversations.

  8. Ranking of languages spoken at home in the U.S. 2024, by number of speakers

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2024, by number of speakers [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    United States
    Description

    In 2024, some 45 million people in the United States spoke Spanish at home. In comparison, the second most spoken non-English language spoken by households was Chinese, at just 3.7 million speakers.The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  9. F

    Canadian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-canada
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Canadian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Canadian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Canadian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Canadian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Canada to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Canadian English.
    Voice Assistants: Build smart assistants capable of understanding natural Canadian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  10. Spanish speakers in countries where Spanish is not an official language 2024...

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Spanish speakers in countries where Spanish is not an official language 2024 [Dataset]. https://www.statista.com/statistics/1276290/number-spanish-speakers-non-hispanic-countries-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    The United States is the non-hispanic country with the largest number of native Spanish speakers in the world, with approximately 41.89 million people with a native command of the language in 2024. However, the European Union had the largest group of non-native speakers with limited proficiency of Spanish, at around 28 million people. Furthermore, Mexico is the country with the largest number of native Spanish speakers in the world as of 2024.

  11. d

    Global English Speech with Accent Conversational Dataset — Multi-Region...

    • datarade.ai
    .wav
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Global English Speech with Accent Conversational Dataset — Multi-Region Validated Speech with Gender, Age & Metadata for AI & NLP Training [Dataset]. https://datarade.ai/data-products/global-english-speech-with-accent-conversational-dataset-mu-filemarket
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Jul 21, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Montenegro, Nicaragua, Tonga, United States Minor Outlying Islands, Iceland, Comoros, Bangladesh, Yemen, Cook Islands, Haiti
    Description

    The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.

    Regions and Covered Countries with Primary Spoken Languages:

    Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)

    Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)

    Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)

    Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)

    Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)

    Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications

    Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling

    This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.

  12. English Gaming speech dataset

    • kaggle.com
    zip
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2024). English Gaming speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-gaming-speech-dataset
    Explore at:
    zip(698585 bytes)Available download formats
    Dataset updated
    Jun 7, 2024
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    English Gaming Real-world Casual Conversation and Monologue speech dataset

    Description

    English Gaming Real-world Casual Conversation and Monologue speech dataset, covers spontaneous dialogue about popular and evergreen games, including player discussions on battle strategies, social interactions, esports news, etc., mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, accent, offensive expression labeling and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link:https://www.nexdata.ai/datasets/speechrecog/1430?source=Kaggle

    Format

    16k Hz, 16 bit, wav, mono channel;

    Content category

    Spontaneous dialogue or monologue about popular and evergreen games (such as FPS, MOBA, MMORPG, VR, and other gaming genres), including player discussions on battle strategies, social interactions, esports news, etc.

    Recording environment

    Mixed(indoor, outdoor,entertainment)

    Country

    the United Kingdom(GBR), the United States(USA), etc.

    Language(Region) Code

    en-GB,en-US, etc.;

    Language

    English;

    Features of annotation

    Transcription text, timestamp, offensive expression labeling, speaker ID, gender, noise;

    Accuracy Rate

    Sentence Accuracy Rate (SAR) 95%.

    Licensing Information

    Commercial License

  13. Z

    English Language Training (ELT) Market By Product Type (English for Academic...

    • zionmarketresearch.com
    pdf
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion Market Research (2025). English Language Training (ELT) Market By Product Type (English for Academic Purposes, English as a Foreign Language, English for Speakers of Other Languages, English as an Additional Language, English as a Second Language, and English for Specific Purposes), By Application (White-collar Workers, Students, Migrants, Travelers, and Job Seekers), By End-user (Educational Institutions, Corporate Sector, Government Organizations, and Individuals), By Learning Method (Classroom, Online, and Blended), and By Region: Global and Regional Industry Overview, Market Intelligence, Comprehensive Analysis, Historical Data, and Forecasts 2025 - 2034 [Dataset]. https://www.zionmarketresearch.com/report/english-language-training-elt-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 22, 2025
    Dataset authored and provided by
    Zion Market Research
    License

    https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy

    Time period covered
    2022 - 2030
    Area covered
    Global
    Description

    Global english language training (ELT) market size is expected to generate revenue of around USD 181.67 Billion by 2034, growing at a CAGR of around 8.1% between 2025 and 2034.

  14. Number of students learning the English language 2023, by country

    • statista.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Number of students learning the English language 2023, by country [Dataset]. https://www.statista.com/statistics/474379/english-language-students-per-country/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    In 2023, the United Kingdom had the most students learning English as a foreign language. There were approximately ******* students who were learning English as a second language that year, followed by Australia with almost ******* foreign students. The third place ranking was completed by the U.S., with around ******* students learning English as a foreign language. The global English learning market Learning English has become increasingly important for young people, especially in terms of international communication, increasing employability at multinational firms or securing work abroad, and specializing in fields as a global expert. The United Kingdom had both the highest number of students learning English as a foreign language and the shared highest distribution of English language learning revenue. In the past few years, the English language learning market has experienced a decrease in revenue, perhaps partially due to more free online platforms becoming accessible, as well as more companies and schools implementing English learning systems. 2022, however, has shown signs of improvement in terms of revenue. In 2020, revenues generated by the English learning market were dramatically lower due to travel restrictions implemented to limit the spread of COVID-19, as this trend continued into 2021. The global language services industry Over the course of the past two decades, the need for bilingual and multilingual skills has increased substantially. In an increasingly globalized world that is becoming more and more connected over time, people have been motivated to learn additional languages to suit both personal and professional demands. The market size of the global language services industry has grown to reflect this, with expected revenues of around ** billion U.S. dollars projected for 2022.

  15. 743 Hours - UK English Spontaneous Dialogue Speech Dataset (Smartphone...

    • nexdata.ai
    Updated May 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 743 Hours - UK English Spontaneous Dialogue Speech Dataset (Smartphone Recorded) [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1393
    Explore at:
    Dataset updated
    May 13, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    United Kingdom
    Variables measured
    Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    This dataset contains 200 hours of spontaneous UK English dialogue recorded using smartphones. Collected from approximately 500 native British English speakers across various regions, the dataset reflects real-world acoustic diversity. Each audio segment is accompanied by detailed annotations, including transcription, timestamp, speaker ID, gender, and more. Ideal for applications in automatic speech recognition (ASR), speech synthesis (TTS), speaker diarization, and natural language understanding (NLU), the dataset has been validated by leading AI companies. All data complies with global privacy regulations such as GDPR, CCPA, and PIPL.

  16. MCB_languages_county

    • kaggle.com
    zip
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county
    Explore at:
    zip(414833 bytes)Available download formats
    Dataset updated
    Oct 1, 2019
    Authors
    Marisol Brewster
    Description

    Context

    This is a dataset I found online through the Google Dataset Search portal.

    Content

    The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

    The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

    The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

    These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

    Acknowledgements

    Sources:

    Google Dataset Search: https://toolbox.google.com/datasetsearch

    2009-2013 American Community Survey

    Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

    Downloaded From: https://data.world/kvaughn/languages-county

    Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

  17. F

    British English Call Center Data for BFSI AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English Call Center Data for BFSI AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/bfsi-call-center-conversation-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This UK English Call Center Speech Dataset for the BFSI (Banking, Financial Services, and Insurance) sector is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. Featuring over 30 hours of real-world, unscripted audio, it offers authentic customer-agent interactions across a range of BFSI services to train robust and domain-aware ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, financial technology teams, and NLP researchers to build high-accuracy, production-ready models across BFSI customer service scenarios.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native UK English speakers. Captured in realistic financial support settings, these conversations span diverse BFSI topics from loan enquiries and card disputes to insurance claims and investment options, providing deep contextual coverage for model training and evaluation.

    Participant Diversity:
    Speakers: 60 native UK English speakers from our verified contributor pool.
    Regions: Representing multiple provinces across United Kingdom to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world BFSI voice coverage.

    Inbound Calls:
    Debit Card Block Request
    Transaction Disputes
    Loan Enquiries
    Credit Card Billing Issues
    Account Closure & Claims
    Policy Renewals & Cancellations
    Retirement & Tax Planning
    Investment Risk Queries, and more
    Outbound Calls:
    Loan & Credit Card Offers
    Customer Surveys
    EMI Reminders
    Policy Upgrades
    Insurance Follow-ups
    Investment Opportunity Calls
    Retirement Planning Reviews, and more

    This variety ensures models trained on the dataset are equipped to handle complex financial dialogues with contextual accuracy.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, background noise)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making financial domain model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent,

  18. Callhome dataset

    • kaggle.com
    zip
    Updated Nov 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ngô Xuân Kiên (2025). Callhome dataset [Dataset]. https://www.kaggle.com/datasets/kienngx/callhome-dataset
    Explore at:
    zip(1787516032 bytes)Available download formats
    Dataset updated
    Nov 2, 2025
    Authors
    Ngô Xuân Kiên
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The CALLHOME audio dataset contains telephone conversations between English speakers. Each recording captures spontaneous, informal speech between family members or friends, making it valuable for research in speaker diarization, speech recognition, and language identification. The dataset provides time-aligned transcriptions and speaker labels, reflecting real-world conversational variability.

  19. F

    American English Call Center Data for Delivery & Logistics AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English Call Center Data for Delivery & Logistics AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/delivery-call-center-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This US English Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.

    Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native US English speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.

    Participant Diversity:
    Speakers: 60 native US English speakers from our verified contributor pool.
    Regions: Multiple provinces of United States of America for accent and dialect diversity.
    Participant Profile: Balanced gender distribution (60% male, 40% female) with ages ranging from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted customer-agent dialogues.
    Call Duration: 5 to 15 minutes on average.
    Audio Format: Stereo WAV, 16-bit depth, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in clean, noise-free, echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.

    Inbound Calls:
    Order Tracking
    Delivery Complaints
    Undeliverable Addresses
    Return Process Enquiries
    Delivery Method Selection
    Order Modifications, and more
    Outbound Calls:
    Delivery Confirmations
    Subscription Offer Calls
    Incorrect Address Follow-ups
    Missed Delivery Notifications
    Delivery Feedback Surveys
    Out-of-Stock Alerts, and others

    This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.

    Transcription

    All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, noise)
    High transcription accuracy with word error rate under 5% via dual-layer quality checks.

    These transcriptions support fast, reliable model development for English voice AI applications in the delivery sector.

    Metadata

    Detailed metadata is included for each participant and conversation:

    Participant Metadata: ID, age, gender, region, accent, dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical attributes.

    This metadata aids in training specialized models, filtering demographics, and running advanced analytics.

    Usage and Applications

    <p

  20. u

    Speech Across Dialects of English: Acoustic Measures from SPADE Project...

    • datacatalogue.ukdataservice.ac.uk
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University (2024). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854959
    Explore at:
    Dataset updated
    Feb 21, 2024
    Authors
    Stuart-Smith, J, University of Glasgow; Sonderegger, M, McGill University; Mielke, J, North Carolina State University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jan 1, 1949 - Jan 1, 2019
    Area covered
    Ireland, Canada, United Kingdom, United States
    Description

    The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.

    Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

    Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

    We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

    Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

    Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
464 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu