In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning technologies. - Get the data
The dataset includes high-quality audio recordings with accurate transcriptions, making it ideal for training and evaluating speech recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fb7af35fb0b3dabe083683bebd27fc5e5%2Fweweewew.PNG?generation=1739885543448162&alt=media" alt="">
The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Word French DatasetHigh-Quality French Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word French Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The database is formatted following the SpeechDat conventions and it includes the following items:•1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers. •Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).•Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:1)agitée3A/ Z i t e•Documentation and statistics are also provided with the database.The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
[!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/calliphonie
[!WARNING] Vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.
Description
Content and technical data:
From Ref. 1
Two speakers (a female and a male, native speakers of French) recorded the corpus. They produced each sentence according to two different instructions: (1) emphasis on a specific word of the sentence (generally the verb) and (2)… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/calliphonie.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Canadian French Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of French language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.
This training dataset includes 6,000+ high-quality scripted audio recordings in Canadian French, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.
This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:
To increase training utility, prompts include contextual data such as:
These additions help your models learn to recognize structured and unstructured retail-related speech.
Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.
Detailed metadata is included to support filtering, analysis, and model evaluation:
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2 spontaneous items (for control)The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
"Facts and Figures, Profiles of Official Language Immigrants: English Speaking Permanent Residents in Quebec presents the annual intake of English-speaking permanent residents in the province of Quebec by category of immigration from 2006 to 2015. The report examines selected characteristics for English-speaking permanent residents.
“English-speaking immigrants” are defined by the following criteria: 1) permanent residents with English as Mother Tongue; 2) permanent residents with Mother Tongue other than English and with “English Only” as official language spoken (excluding “Both English and French” as official language spoken). Note that official language(s) spoken (English only, French only, both French and English, and neither language) are self-declared indicators of knowledge of an official language.
Please note that in these datasets, the figures have been suppressed or rounded to prevent the identification of individuals when the datasets are compiled and compared with other publicly available statistics. Values between 0 and 5 are shown as “--“ and all other values are rounded to the nearest multiple of 5. This may result to the sum of the figures not equating to the totals indicated. "
About 5 231 500 people reported to the 2001 Census that they were bilingual, compared with 4 841 300 five years earlier, an 8.1% increase. In 2001, these individuals represented 17.7% of the population, up from 17.0% in 1996. Nationally, 43.4% of francophones reported that they were bilingual, compared with 9.0% of anglophones. Within Quebec, the growth in the bilingualism rate from 1996 to 2001 was even greater than in the previous five-year period. In 2001, two out of every five individuals (40.8%) reported that they were bilingual, compared with 37.8% in 1996 and 35.4% in 1991.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We aim to collect, clean, and store corpora of Fon and French sentences for Natural Language Processing researches including Neural Machine Translation, Named Entity Recognition, etc. for Fon, a very low-resourced and endangered African native language.
Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, Togo, and Nigeria - by about 2 million people.
As training data is crucial to the high performance of a machine learning model, the aim of this project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon.
Through crowdsourcing, Google Form Surveys, we gathered and cleaned #25377 parallel Fon-French# all based on daily conversations.
To the crowdsourcing, creation, and cleaning of this version have contributed:
1) Name: Bonaventure DOSSOUAffiliation: MSc Student in Data Engineering, Jacobs UniversityContact: femipancrace.dossou@gmail.com
2) Name: Ricardo AHOUNVLAMEAffiliation: Student in LinguisticsContact: tontonjars@gmail.com
3) Name: Fabroni YOCLOUNONAffiliation: Creator of the Label IamYourClounonContact: iamyourclounon@gmail.com
4) Name: BeninLanguesAffiliation: BeninLanguesContact: https://beninlangues.com/
5) Name: Chris EmezueAffiliation: MSc Student in Mathematics in Data Science, Technical University of MunichContact: chris.emezue@gmail.com
_
To join as a contributor, please contact us at: 1) https://twitter.com/bonadossou 2) https://twitter.com/ChrisEmezue 3) https://twitter.com/edAIOfficialOr contact Bonaventure Dossou (femipancrace.dossou@gmail.com), Chris Emezue (chris.emezue@gmail.com)_
Clavier Fongbé (WebView): https://bonaventuredossou.github.io/clavierfongbe/ (Made by Bonaventure Dossou)Clavier Fongbé (Mobile Android Version): https://play.google.com/store/apps/details?id=com.fulbertodev.clavierfongbe&hl=en&gl=US (Fabroni Yoclounon, Bonventure Dossou et. al.)
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Canadian French Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced French speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.
This dataset includes over 6,000 scripted prompt recordings in Canadian French, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.
This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:
To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:
Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.
Each data point is enriched with detailed metadata for advanced training and analysis:
This BFSI-focused dataset
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Presenting the Canadian French Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of French speech recognition and voice AI models specifically tailored for the telecommunications industry.
This dataset includes over 6,000 high-quality scripted prompt recordings in Canadian French, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
The dataset reflects a wide variety of common telecom customer interactions, including:
To maximize contextual richness, prompts include:
Each audio file is paired with an accurate, verbatim transcription for precise model training:
Detailed metadata is included to
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for education teaching french to speakers of other languages in the U.S.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🇫🇷 French TTS Dataset
This dataset contains French speech audio paired with clean transcriptions, intended for training text-to-speech models such as Spark-TTS or Coqui TTS.
📁 Contents
dataset.parquet — metadata file with audio paths, transcriptions, speaker info Audio/ — directory of all .wav files used for training
📊 Dataset Structure
The dataset.parquet file includes the following columns:
Column Description
audio Path to .wav file
text… See the full description on the dataset page: https://huggingface.co/datasets/Buck26/tts-french-dataset.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
This corpus consists of approximately 22 hours of speech recordings. Transcripts are provided for all the recordings. The corpus can be divided into 3 parts:
Collected by a team from the U.S. Military Academy's Center for Technology Enhanced Language Learning (CTELL) in 2003 in Yaoundé, Cameroon. It has recordings from 84 speakers, 48 male and 36 female.
This part was collected by a RDECOM Science Team who participated in the United Nations exercise Central Accord 16 (CA16) in Libreville, Gabon in June 2016. The Science Team included DARPA's Dr. Boyan Onyshkevich and Dr. Aaron Lawson (SRI International), as well as RDECOM scientists. It has recordings from 125 speakers from Cameroon, Chad, Congo and Gabon.
This part was collected from 23 speakers in Niamey, Niger, Oct. 26-30 2015. These speakers were students in a course for officers and sergeants presented by Army trainers assigned to U.S. Army Africa. The data was collected by RDECOM Science & Technology Advisors Major Eddie Strimel and Mr. Bill Bergen.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-19 has infected many people in France.
The dataset is no longer updated. It contains almost all French metropolitan regions plus overseas regions, updated on March 09 2020. If you want to help updating this dataset, see contributions section below.
This dataset intention is to put all published information about COVID-19 patients in France in a csv file.
Source of data: Press releases of the French regional health agencies. Data transcripted in a csv by a GitHub community.
This work is inspired by a similar work made in South Korea: kaggle dataset.
We need more contributors to build this dataset and keep it updated. Join us on GitHub.
Contributors: Lior Perez, Samia Drappeau, Manon Fourniol, Zoragna, Raphaël Presberg
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.
Key Features of the Dataset: Comprehensive Coverage:
The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:
News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:
The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:
Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:
Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:
The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:
Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Canadian French Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of French language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Canadian French, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.