42 datasets found

Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
h
french-speech-recognition-dataset
huggingface.co
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). french-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/french-speech-recognition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2025
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
French
Description
French Speech Dataset for recognition task

Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems. By utilizing this dataset, researchers and developers can advance their understanding… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/french-speech-recognition-dataset.
s
Wake Word French Dataset
shaip.com
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word French Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-french-dataset/
Explore at:
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
French
Description
Home Wake Word French DatasetHigh-Quality French Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word French Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
calliphonie
huggingface.co
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datasets-CNRS (2024). calliphonie [Dataset]. https://huggingface.co/datasets/datasets-CNRS/calliphonie
Explore at:
Dataset updated
Oct 21, 2024
Dataset provided by
French National Centre for Scientific Researchhttp://www.cnrs.fr/
Authors
datasets-CNRS
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
[!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/calliphonie

[!WARNING] Vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.

Description

Content and technical data:

From Ref. 1

Two speakers (a female and a male, native speakers of French) recorded the corpus. They produced each sentence according to two different instructions: (1) emphasis on a specific word of the sentence (generally the verb) and (2)… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/calliphonie.
F
Canadian French Retail Scripted Monologue Speech Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Retail Scripted Monologue Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/retail-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada, French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Canadian French Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of French language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.
Speech Data
This training dataset includes 6,000+ high-quality scripted audio recordings in Canadian French, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.
•Participant Diversity
•
Speakers: 60 native French speakers from across Canada

•
Geographic Coverage: Multiple Canada regions to ensure dialect and accent diversity

•
Demographics: Participants aged 18 to 70, with a 60:40 male-to-female distribution

•Recording Details
•
Nature of Recording: Scripted monologue-style speech prompts

•
Duration: Each recording spans 5 to 30 seconds

•
Audio Format: WAV format, mono channel, 16-bit depth, and 8kHz / 16kHz sample rates

•
Environment: Recorded in quiet conditions, free from background noise and echo

Topic Diversity
This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:
•Customer Service Interactions
•Order Placement and Payment Processes
•Product and Service Inquiries
•Technical Support Queries
•General Information and Guidance
•Promotional and Sales Announcements
•Domain-Specific Service Statements
Contextual Enrichment
To increase training utility, prompts include contextual data such as:
•
Region-Specific Names: Common Canada male and female names in diverse formats

•
Addresses: Localized address variations spoken naturally

•
Dates & Times: Realistic phrasing in delivery, promotions, and return policies

•
Product References: Real-world product names, brands, and categories

•
Numerical Data: Spoken numbers and prices used in transactions and offers

•
Order IDs & Tracking Numbers: Common references in customer service calls

These additions help your models learn to recognize structured and unstructured retail-related speech.
Transcription
Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.
•
Content: Exact scripted prompts as spoken by the participant

•
Format: Provided in plain text (.TXT) format with filenames matching the associated audio

•
Quality Assurance: All transcripts are verified for accuracy by native French transcribers

Metadata
Detailed metadata is included to support filtering, analysis, and model evaluation:
<span
F
Canadian French Scripted Monologue Speech Data for Telecom
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French, Canada
Dataset funded by
FutureBeeAI
Description
Introduction
Presenting the Canadian French Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of French speech recognition and voice AI models specifically tailored for the telecommunications industry.
Speech Data
This dataset includes over 6,000 high-quality scripted prompt recordings in Canadian French, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
•Participant Diversity
•
Speakers: 60 native Canadian French speakers

•
Geographic Distribution: Carefully selected from multiple regions across Canada to capture a wide spectrum of dialects and speaking styles

•
Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years

•Recording Specifications
•
Type: Scripted monologue prompts focused on telecom industry use cases

•
Duration: Each audio clip ranges from 5 to 30 seconds

•
Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz

•
Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

Topic Coverage
The dataset reflects a wide variety of common telecom customer interactions, including:
•Customer onboarding and service inquiries
•Billing and payment questions
•Data plans and product information
•Technical support requests
•Network coverage discussions
•Regulatory compliance and policy information
•Upgrades, renewals, and service plan changes
•Domain-specific scripted interactions tailored to real-world telecom use cases
Contextual Depth
To maximize contextual richness, prompts include:
•
Localized Names: Common Canada names in various formats

•
Addresses: Region-specific address structures for realism

•
Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)

•
Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.

•
Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures

•
Service Providers: References to telecom companies and third-party service entities

Transcription
Each audio file is paired with an accurate, verbatim transcription for precise model training:
•
Content: Transcriptions are direct representations of each recorded prompt

•
Format: Plain text (.TXT), with filenames matching their corresponding audio files

•
Verification: Every transcription is manually verified by native Canadian French linguists to ensure consistency and accuracy

Metadata
Detailed metadata is included to
2025 Green Card Report for Education Teaching French To Speakers Of Other...
myvisajobs.com
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyVisaJobs (2025). 2025 Green Card Report for Education Teaching French To Speakers Of Other Languages [Dataset]. https://www.myvisajobs.com/reports/green-card/major/education-teaching-french-to-speakers-of-other-languages/
Explore at:
Dataset updated
Jan 16, 2025
Dataset provided by
MyVisaJobs.com
Authors
MyVisaJobs
License
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
Area covered
French
Variables measured
Major, Salary, Petitions Filed
Description
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for education teaching french to speakers of other languages in the U.S.
E
OrienTel French as spoken in Morocco database
catalogue.elra.info
live.european-language-grid.eu
Updated Feb 22, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). OrienTel French as spoken in Morocco database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0185/
Explore at:
Dataset updated
Feb 22, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
French, Morocco
Description
The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2 spontaneous items (for control)The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
h
african_accented_french
huggingface.co
Updated Jun 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Théo Gigant (2022). african_accented_french [Dataset]. https://huggingface.co/datasets/gigant/african_accented_french
Explore at:
Dataset updated
Jun 7, 2022
Authors
Théo Gigant
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Area covered
French
Description
This corpus consists of approximately 22 hours of speech recordings. Transcripts are provided for all the recordings. The corpus can be divided into 3 parts:

Yaounde

Collected by a team from the U.S. Military Academy's Center for Technology Enhanced Language Learning (CTELL) in 2003 in Yaoundé, Cameroon. It has recordings from 84 speakers, 48 male and 36 female.

CA16

This part was collected by a RDECOM Science Team who participated in the United Nations exercise Central Accord 16 (CA16) in Libreville, Gabon in June 2016. The Science Team included DARPA's Dr. Boyan Onyshkevich and Dr. Aaron Lawson (SRI International), as well as RDECOM scientists. It has recordings from 125 speakers from Cameroon, Chad, Congo and Gabon.

Niger

This part was collected from 23 speakers in Niamey, Niger, Oct. 26-30 2015. These speakers were students in a course for officers and sergeants presented by Army trainers assigned to U.S. Army Africa. The data was collected by RDECOM Science & Technology Advisors Major Eddie Strimel and Mr. Bill Bergen.
Common languages used for web content 2025, by share of websites
statista.com
ai-chatbox.pro
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
e
Percent of Population with Limited Ability to Speak English
coronavirus-resources.esri.com
data.amerigeoss.org
+1more
Updated Jul 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urban Observatory by Esri (2019). Percent of Population with Limited Ability to Speak English [Dataset]. https://coronavirus-resources.esri.com/maps/78a668915cbc4bf983330608f3d687aa
Explore at:
Dataset updated
Jul 3, 2019
Dataset authored and provided by
Urban Observatory by Esri
Area covered

Description
This map shows the percent of population with a limited ability to speak English by census tract. Search to your community and investigate the top language needs in nearby census tracts.*DATA AS OF 2011-2015*Data Source: U.S. Census Bureau's American Community Survey 5-year estimates, 2011-2015, Table B16001.Complete list of all languages available in this data set (29):Spanish or Spanish Creole; French (including Patois, Cajun); French Creole; Italian; Portuguese; German; Yiddish; Greek; Russian; Polish; Serbo-Croatian; Armenian; Persian; Gujarati; Hindi; Urdu; Chinese; Japanese; Korean; Mon-Khmer, Cambodian; Hmong; Thai; Laotian; Vietnamese; Tagalog; Navajo; Hungarian; Arabic; Hebrew. Those who have limited English ability and speak other languages are included in the percentage depicted in the map, but other languages will not appear in the ranked list or in the table.Accompanying feature layer and viewing app are also available.
News Events Data in Latin America( Techsalerator)
datarade.ai
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Techsalerator (2024). News Events Data in Latin America( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-latin-america-techsalerator-techsalerator
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
Techsalerator LLC
Authors
Techsalerator
Area covered
Americas, Latin America, Argentina, Falkland Islands (Malvinas), Cuba, Aruba, Chile, French Guiana, Ecuador, Martinique, Montserrat, Dominican Republic
Description
Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.

Key Features of the Dataset: Comprehensive Coverage:

The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:

News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:

The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:

Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:

The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:

Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
Dialogue_francais_role_play
huggingface.co
Updated Jan 29, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datasets-CNRS (2009). Dialogue_francais_role_play [Dataset]. https://huggingface.co/datasets/datasets-CNRS/Dialogue_francais_role_play
Explore at:
Dataset updated
Jan 29, 2009
Dataset provided by
French National Centre for Scientific Researchhttp://www.cnrs.fr/
Authors
datasets-CNRS
Area covered
French
Description
[!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/sldr000738 and https://www.ortolang.fr/market/corpora/sldr000739

[!CAUTION] Ce jeu de données ne contient que les transcriptions. Pour récupérer les audios (sldr000738), vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.

Description

Dialogue in French (role-play). The speech material used here contains dialogues spoken by 38 native speakers of French (10 pairs of… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/Dialogue_francais_role_play.
F
Canadian French Scripted Monologue Speech Dataset for BFSI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French, Canada
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Canadian French Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced French speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.
Speech Data
This dataset includes over 6,000 scripted prompt recordings in Canadian French, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.
•Participant Diversity
•
Speakers: 60 native Canadian French speakers.

•
Regions: Diverse representation from various Canada provinces to ensure dialect and accent coverage.

•
Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.

•Recording Details
•
Nature: Scripted monologues and domain-specific prompt recordings.Duration:

•
Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.

•Environment: Clean, echo-free, and noise-free environments.
Topic & Context Diversity
This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:
•Customer service interactions
•Financial transactions & balance inquiries
•Banking and insurance product queries
•Loan & credit support
•Regulatory and compliance questions
•Technical help and password resets
•Promotional campaigns and service updates
Contextual Elements
To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:
•
Names: Region-specific names in multiple formats

•
Addresses: Local address structures and pronunciations

•
Dates & Times: Typical time expressions used in banking

•
Organization Names: Names of banks, financial firms, and institutions

•
Currencies & Amounts: Spoken currency formats, prices, and numeric data

•
IDs & Transaction Numbers: For authentic service simulation

Transcription
Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.
•
Content: Exact match of each prompt

•
Format: Clean .TXT files, mapped to audio file names

•
Accuracy: Reviewed and validated by native Canadian French linguists

Metadata
Each data point is enriched with detailed metadata for advanced training and analysis:
•
Participant Metadata: Unique ID, age, gender, state, country, dialect

•
Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

Applications and Use Cases
This BFSI-focused dataset
E
OrienTel French as spoken in Tunisia database
catalogue.elra.info
live.european-language-grid.eu
Updated Feb 22, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). OrienTel French as spoken in Tunisia database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0188/
Explore at:
Dataset updated
Feb 22, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Tunisia, French
Description
The OrienTel French as spoken in Tunisia database comprises 576 Tunisian speakers of French (290 males, 286 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2+3 spontaneous items (for control)The following age distribution has been obtained: 2 speakers are below 16, 407 speakers are between 16 and 30, 104 speakers are between 31 and 45, 59 speakers are between 46 and 60, 4 speakers are over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
E
MEDIA speech database for French
catalogue.elra.info
live.european-language-grid.eu
Updated Mar 27, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). MEDIA speech database for French [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/
Explore at:
Dataset updated
Mar 27, 2008
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
French
Description
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The database is formatted following the SpeechDat conventions and it includes the following items:•1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers. •Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).•Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:1)agitée3A/ Z i t e•Documentation and statistics are also provided with the database.The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).
E
BREF-120 - A large corpus of French read speech
catalogue.elra.info
live.european-language-grid.eu
Updated Feb 22, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). BREF-120 - A large corpus of French read speech [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0067/
Explore at:
Dataset updated
Feb 22, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
French
Description
BREF-120 resulted from the efforts of LIMSI-CNRS researchers under sponsorship from the GDR-PRC CHM, the ACCT (OFIL), the EEC (ESPRIT Polyglot project), and the Aupelf-Uref.A sub-set of BREF-120 is BREF-80 (ELRA-S0006), which consists of about 50-60 sentences per speaker and recordings conducted only with a Shure microphone. In BREF-80, the sentences were chosen to cover as many prompts as possible.The BREF-120 corpus was designed to provide read speech data for the development and evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a large corpus of continuous speech for the acquisition of acoustic-phonetic knowledge of spoken French.BREF-120 is a large read-speech corpus containing over 100 hours of speech material, from 120 speakers (55 males and 65 females). The text materials were selected verbatim from extracts of the French newspaper "Le Monde". Each of 80 speakers read approximately 10,000 words (about 650 sentences) of text, and another 40 speakers each read about half that amount. Simultaneous recordings were made in a sound-proof room using a Shure SM10 microphone and a Crown PCC160 microphone and were monitored to assure their contents. The speech signal was sampled at 16 kHz and digitised with 16 bits. The BREF-120 corpus contains 28 CDs; numbers 1-13 contain the Shure recorded data and numbers 14-28 contain the Crown recorded data
o
Data and Code for: The Impact of Host Language Proficiency on Migrants'...
openicpsr.org
delimited
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Schmid (2023). Data and Code for: The Impact of Host Language Proficiency on Migrants' Employment Outcomes [Dataset]. http://doi.org/10.3886/E183861V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E183861V1
Dataset updated
Jan 6, 2023
Dataset provided by
American Economic Association
Authors
Lukas Schmid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2008 - Dec 31, 2017
Area covered
Switzerland
Description
This paper estimates the economic gains from proficiency in the host country's language on migrants' employment outcomes by exploiting the exogenous placement of refugees to Swiss cantons and a sharp language border dividing German- and French-speaking regions. Using administrative data on African refugees who applied for Swiss asylum between 2008 and 2017, I compare French-speaking refugees assigned to the French-speaking region to French-speaking refugees assigned to the German-speaking region, and adjust for common regional differences with outcomes from English-speaking African refugees. The results suggest that language proficiency more than doubles the employment level in the first five years after arrival.
d
Global English Speech with Accent Conversational Dataset — Multi-Region...
datarade.ai
.wav
Updated Jul 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2025). Global English Speech with Accent Conversational Dataset — Multi-Region Validated Speech with Gender, Age & Metadata for AI & NLP Training [Dataset]. https://datarade.ai/data-products/global-english-speech-with-accent-conversational-dataset-mu-filemarket
Explore at:
.wavAvailable download formats
Dataset updated
Jul 21, 2025
Dataset authored and provided by
FileMarket
Area covered
Nicaragua, United States Minor Outlying Islands, Tonga, Iceland, Comoros, Haiti, Montenegro, Cook Islands, Bangladesh, Yemen
Description
The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.

Regions and Covered Countries with Primary Spoken Languages:

Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)

Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)

Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)

Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)

Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)

Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications

Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling

This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/

Ranking of languages spoken at home in the U.S. 2023

Explore at:

15 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2023

Area covered

United States

Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Clear search

Close search

Google apps

Main menu

Ranking of languages spoken at home in the U.S. 2023

The most spoken languages worldwide 2025

french-speech-recognition-dataset

Wake Word French Dataset

calliphonie

Canadian French Retail Scripted Monologue Speech Dataset

Introduction

Speech Data

Topic Diversity

Contextual Enrichment

Transcription

Metadata

Canadian French Scripted Monologue Speech Data for Telecom

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

Metadata

2025 Green Card Report for Education Teaching French To Speakers Of Other...

OrienTel French as spoken in Morocco database

african_accented_french

Common languages used for web content 2025, by share of websites

Percent of Population with Limited Ability to Speak English

News Events Data in Latin America( Techsalerator)

Dialogue_francais_role_play

Canadian French Scripted Monologue Speech Dataset for BFSI

Introduction

Speech Data

Topic & Context Diversity

Contextual Elements

Transcription

Metadata

Applications and Use Cases

OrienTel French as spoken in Tunisia database

MEDIA speech database for French

BREF-120 - A large corpus of French read speech

Data and Code for: The Impact of Host Language Proficiency on Migrants'...

Global English Speech with Accent Conversational Dataset — Multi-Region...

Ranking of languages spoken at home in the U.S. 2023