62 datasets found

Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
French Speech Recognition Dataset
kaggle.com
Updated Jun 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). French Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/french-speech-recognition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
French
Description
French Speech Dataset for recognition task

Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.

By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning technologies. - Get the data

The dataset includes high-quality audio recordings with accurate transcriptions, making it ideal for training and evaluating speech recognition models.

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Metadata for the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fb7af35fb0b3dabe083683bebd27fc5e5%2Fweweewew.PNG?generation=1739885543448162&alt=media" alt="">

Audio files: High-quality recordings in WAV format

Text transcriptions: Accurate and detailed transcripts for each audio segment

Speaker information: Metadata on native speakers, including gender and etc

Topics: Diverse domains such as general conversations, business and etc

The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects
s
Wake Word French Dataset
shaip.com
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word French Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-french-dataset/
Explore at:
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
French
Description
Home Wake Word French DatasetHigh-Quality French Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word French Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
E
MEDIA speech database for French
catalogue.elra.info
live.european-language-grid.eu
Updated Mar 27, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). MEDIA speech database for French [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/
Explore at:
Dataset updated
Mar 27, 2008
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
French
Description
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The database is formatted following the SpeechDat conventions and it includes the following items:•1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers. •Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).•Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:1)agitée3A/ Z i t e•Documentation and statistics are also provided with the database.The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).
calliphonie
huggingface.co
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datasets-CNRS (2024). calliphonie [Dataset]. https://huggingface.co/datasets/datasets-CNRS/calliphonie
Explore at:
Dataset updated
Oct 21, 2024
Dataset provided by
French National Centre for Scientific Researchhttp://www.cnrs.fr/
Authors
datasets-CNRS
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
[!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/calliphonie

[!WARNING] Vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.

Description

Content and technical data:

From Ref. 1

Two speakers (a female and a male, native speakers of French) recorded the corpus. They produced each sentence according to two different instructions: (1) emphasis on a specific word of the sentence (generally the verb) and (2)… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/calliphonie.
F
Canadian French Retail Scripted Monologue Speech Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Retail Scripted Monologue Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/retail-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada, French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Canadian French Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of French language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.
Speech Data
This training dataset includes 6,000+ high-quality scripted audio recordings in Canadian French, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.
•Participant Diversity
•
Speakers: 60 native French speakers from across Canada

•
Geographic Coverage: Multiple Canada regions to ensure dialect and accent diversity

•
Demographics: Participants aged 18 to 70, with a 60:40 male-to-female distribution

•Recording Details
•
Nature of Recording: Scripted monologue-style speech prompts

•
Duration: Each recording spans 5 to 30 seconds

•
Audio Format: WAV format, mono channel, 16-bit depth, and 8kHz / 16kHz sample rates

•
Environment: Recorded in quiet conditions, free from background noise and echo

Topic Diversity
This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:
•Customer Service Interactions
•Order Placement and Payment Processes
•Product and Service Inquiries
•Technical Support Queries
•General Information and Guidance
•Promotional and Sales Announcements
•Domain-Specific Service Statements
Contextual Enrichment
To increase training utility, prompts include contextual data such as:
•
Region-Specific Names: Common Canada male and female names in diverse formats

•
Addresses: Localized address variations spoken naturally

•
Dates & Times: Realistic phrasing in delivery, promotions, and return policies

•
Product References: Real-world product names, brands, and categories

•
Numerical Data: Spoken numbers and prices used in transactions and offers

•
Order IDs & Tracking Numbers: Common references in customer service calls

These additions help your models learn to recognize structured and unstructured retail-related speech.
Transcription
Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.
•
Content: Exact scripted prompts as spoken by the participant

•
Format: Provided in plain text (.TXT) format with filenames matching the associated audio

•
Quality Assurance: All transcripts are verified for accuracy by native French transcribers

Metadata
Detailed metadata is included to support filtering, analysis, and model evaluation:
<span
E
OrienTel French as spoken in Morocco database
catalogue.elra.info
live.european-language-grid.eu
Updated Feb 22, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). OrienTel French as spoken in Morocco database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0185/
Explore at:
Dataset updated
Feb 22, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Morocco, French
Description
The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2 spontaneous items (for control)The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
A
Facts and Figures 2015: Profiles of Official Language Immigrants: English...
data.amerigeoss.org
open.canada.ca
xls
Updated Jul 22, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Canada (2019). Facts and Figures 2015: Profiles of Official Language Immigrants: English Speaking Permanent Residents inside Quebec [Dataset]. https://data.amerigeoss.org/dataset/caa61377-f34c-4f31-89ae-a57c8a73f99d
Explore at:
xlsAvailable download formats
Dataset updated
Jul 22, 2019
Dataset provided by
Canada
Area covered
Québec City, Quebec
Description
"Facts and Figures, Profiles of Official Language Immigrants: English Speaking Permanent Residents in Quebec presents the annual intake of English-speaking permanent residents in the province of Quebec by category of immigration from 2006 to 2015. The report examines selected characteristics for English-speaking permanent residents.

“English-speaking immigrants” are defined by the following criteria: 1) permanent residents with English as Mother Tongue; 2) permanent residents with Mother Tongue other than English and with “English Only” as official language spoken (excluding “Both English and French” as official language spoken). Note that official language(s) spoken (English only, French only, both French and English, and neither language) are self-declared indicators of knowledge of an official language.

Please note that in these datasets, the figures have been suppressed or rounded to prevent the identification of individuals when the datasets are compiled and compared with other publicly available statistics. Values between 0 and 5 are shown as “--“ and all other values are rounded to the nearest multiple of 5. This may result to the sum of the figures not equating to the totals indicated. "
A
English-French Bilingualism, 2001 (by census division)
data.amerigeoss.org
data.urbandatacentre.ca
+5more
jp2, zip
Updated Jul 22, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Canada (2019). English-French Bilingualism, 2001 (by census division) [Dataset]. https://data.amerigeoss.org/tl/dataset/e52da9e1-8893-11e0-9c34-6cf049291510
Explore at:
zip, jp2Available download formats
Dataset updated
Jul 22, 2019
Dataset provided by
Canada
Area covered
French
Description
About 5 231 500 people reported to the 2001 Census that they were bilingual, compared with 4 841 300 five years earlier, an 8.1% increase. In 2001, these individuals represented 17.7% of the population, up from 17.0% in 1996. Nationally, 43.4% of francophones reported that they were bilingual, compared with 9.0% of anglophones. Within Quebec, the growth in the bilingualism rate from 1996 to 2001 was even greater than in the previous five-year period. In 2001, two out of every five individuals (40.8%) reported that they were bilingual, compared with 37.8% in 1996 and 35.4% in 1991.
E
Fon French Daily Dialogues Parallel Data
live.european-language-grid.eu
huggingface.co
+1more
csv
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Fon French Daily Dialogues Parallel Data [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7709
Explore at:
csvAvailable download formats
Dataset updated
Apr 11, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
We aim to collect, clean, and store corpora of Fon and French sentences for Natural Language Processing researches including Neural Machine Translation, Named Entity Recognition, etc. for Fon, a very low-resourced and endangered African native language.
Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, Togo, and Nigeria - by about 2 million people.
As training data is crucial to the high performance of a machine learning model, the aim of this project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon.
Through crowdsourcing, Google Form Surveys, we gathered and cleaned #25377 parallel Fon-French# all based on daily conversations.
To the crowdsourcing, creation, and cleaning of this version have contributed:
1) Name: Bonaventure DOSSOUAffiliation: MSc Student in Data Engineering, Jacobs UniversityContact: femipancrace.dossou@gmail.com
2) Name: Ricardo AHOUNVLAMEAffiliation: Student in LinguisticsContact: tontonjars@gmail.com
3) Name: Fabroni YOCLOUNONAffiliation: Creator of the Label IamYourClounonContact: iamyourclounon@gmail.com
4) Name: BeninLanguesAffiliation: BeninLanguesContact: https://beninlangues.com/
5) Name: Chris EmezueAffiliation: MSc Student in Mathematics in Data Science, Technical University of MunichContact: chris.emezue@gmail.com
_
To join as a contributor, please contact us at: 1) https://twitter.com/bonadossou 2) https://twitter.com/ChrisEmezue 3) https://twitter.com/edAIOfficialOr contact Bonaventure Dossou (femipancrace.dossou@gmail.com), Chris Emezue (chris.emezue@gmail.com)_
Clavier Fongbé (WebView): https://bonaventuredossou.github.io/clavierfongbe/ (Made by Bonaventure Dossou)Clavier Fongbé (Mobile Android Version): https://play.google.com/store/apps/details?id=com.fulbertodev.clavierfongbe&hl=en&gl=US (Fabroni Yoclounon, Bonventure Dossou et. al.)
F
Canadian French Scripted Monologue Speech Dataset for BFSI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada, French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Canadian French Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced French speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.
Speech Data
This dataset includes over 6,000 scripted prompt recordings in Canadian French, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.
•Participant Diversity
•
Speakers: 60 native Canadian French speakers.

•
Regions: Diverse representation from various Canada provinces to ensure dialect and accent coverage.

•
Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.

•Recording Details
•
Nature: Scripted monologues and domain-specific prompt recordings.Duration:

•
Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.

•Environment: Clean, echo-free, and noise-free environments.
Topic & Context Diversity
This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:
•Customer service interactions
•Financial transactions & balance inquiries
•Banking and insurance product queries
•Loan & credit support
•Regulatory and compliance questions
•Technical help and password resets
•Promotional campaigns and service updates
Contextual Elements
To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:
•
Names: Region-specific names in multiple formats

•
Addresses: Local address structures and pronunciations

•
Dates & Times: Typical time expressions used in banking

•
Organization Names: Names of banks, financial firms, and institutions

•
Currencies & Amounts: Spoken currency formats, prices, and numeric data

•
IDs & Transaction Numbers: For authentic service simulation

Transcription
Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.
•
Content: Exact match of each prompt

•
Format: Clean .TXT files, mapped to audio file names

•
Accuracy: Reviewed and validated by native Canadian French linguists

Metadata
Each data point is enriched with detailed metadata for advanced training and analysis:
•
Participant Metadata: Unique ID, age, gender, state, country, dialect

•
Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

Applications and Use Cases
This BFSI-focused dataset
F
Canadian French Scripted Monologue Speech Data for Telecom
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada, French
Dataset funded by
FutureBeeAI
Description
Introduction
Presenting the Canadian French Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of French speech recognition and voice AI models specifically tailored for the telecommunications industry.
Speech Data
This dataset includes over 6,000 high-quality scripted prompt recordings in Canadian French, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
•Participant Diversity
•
Speakers: 60 native Canadian French speakers

•
Geographic Distribution: Carefully selected from multiple regions across Canada to capture a wide spectrum of dialects and speaking styles

•
Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years

•Recording Specifications
•
Type: Scripted monologue prompts focused on telecom industry use cases

•
Duration: Each audio clip ranges from 5 to 30 seconds

•
Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz

•
Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

Topic Coverage
The dataset reflects a wide variety of common telecom customer interactions, including:
•Customer onboarding and service inquiries
•Billing and payment questions
•Data plans and product information
•Technical support requests
•Network coverage discussions
•Regulatory compliance and policy information
•Upgrades, renewals, and service plan changes
•Domain-specific scripted interactions tailored to real-world telecom use cases
Contextual Depth
To maximize contextual richness, prompts include:
•
Localized Names: Common Canada names in various formats

•
Addresses: Region-specific address structures for realism

•
Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)

•
Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.

•
Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures

•
Service Providers: References to telecom companies and third-party service entities

Transcription
Each audio file is paired with an accurate, verbatim transcription for precise model training:
•
Content: Transcriptions are direct representations of each recorded prompt

•
Format: Plain text (.TXT), with filenames matching their corresponding audio files

•
Verification: Every transcription is manually verified by native Canadian French linguists to ensure consistency and accuracy

Metadata
Detailed metadata is included to
m
2025 Green Card Report for Education Teaching French To Speakers Of Other...
myvisajobs.com
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyVisaJobs (2025). 2025 Green Card Report for Education Teaching French To Speakers Of Other Languages [Dataset]. https://www.myvisajobs.com/reports/green-card/major/education-teaching-french-to-speakers-of-other-languages/
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
MyVisaJobs
License
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
Area covered
French
Variables measured
Major, Salary, Petitions Filed
Description
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for education teaching french to speakers of other languages in the U.S.
h
tts-french-dataset
huggingface.co
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mwila Bwalya David (2025). tts-french-dataset [Dataset]. https://huggingface.co/datasets/Buck26/tts-french-dataset
Explore at:
Dataset updated
Aug 4, 2025
Authors
Mwila Bwalya David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
🇫🇷 French TTS Dataset

This dataset contains French speech audio paired with clean transcriptions, intended for training text-to-speech models such as Spark-TTS or Coqui TTS.

📁 Contents

dataset.parquet — metadata file with audio paths, transcriptions, speaker info Audio/ — directory of all .wav files used for training

📊 Dataset Structure

The dataset.parquet file includes the following columns:

Column Description

audio Path to .wav file

text… See the full description on the dataset page: https://huggingface.co/datasets/Buck26/tts-french-dataset.
h
african_accented_french
huggingface.co
Updated Jun 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Théo Gigant (2022). african_accented_french [Dataset]. https://huggingface.co/datasets/gigant/african_accented_french
Explore at:
Dataset updated
Jun 7, 2022
Authors
Théo Gigant
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Area covered
French
Description
This corpus consists of approximately 22 hours of speech recordings. Transcripts are provided for all the recordings. The corpus can be divided into 3 parts:

Yaounde

Collected by a team from the U.S. Military Academy's Center for Technology Enhanced Language Learning (CTELL) in 2003 in Yaoundé, Cameroon. It has recordings from 84 speakers, 48 male and 36 female.

CA16

This part was collected by a RDECOM Science Team who participated in the United Nations exercise Central Accord 16 (CA16) in Libreville, Gabon in June 2016. The Science Team included DARPA's Dr. Boyan Onyshkevich and Dr. Aaron Lawson (SRI International), as well as RDECOM scientists. It has recordings from 125 speakers from Cameroon, Chad, Congo and Gabon.

Niger

This part was collected from 23 speakers in Niamey, Niger, Oct. 26-30 2015. These speakers were students in a course for officers and sergeants presented by Army trainers assigned to U.S. Army Africa. The data was collected by RDECOM Science & Technology Advisors Major Eddie Strimel and Mr. Bill Bergen.
Data from: Coronavirus France dataset
kaggle.com
Updated Mar 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lior Perez (2020). Coronavirus France dataset [Dataset]. https://www.kaggle.com/datasets/lperez/coronavirus-france-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lior Perez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France
Description
Context

COVID-19 has infected many people in France.

Content

The dataset is no longer updated. It contains almost all French metropolitan regions plus overseas regions, updated on March 09 2020. If you want to help updating this dataset, see contributions section below.

This dataset intention is to put all published information about COVID-19 patients in France in a csv file.

Acknowledgements

Source of data: Press releases of the French regional health agencies. Data transcripted in a csv by a GitHub community.

This work is inspired by a similar work made in South Korea: kaggle dataset.

Contributions

We need more contributors to build this dataset and keep it updated. Join us on GitHub.

Contributors: Lior Perez, Samia Drappeau, Manon Fourniol, Zoragna, Raphaël Presberg
o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+2more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
News Events Data in Latin America( Techsalerator)
datarade.ai
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Techsalerator (2024). News Events Data in Latin America( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-latin-america-techsalerator-techsalerator
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
Techsalerator LLC
Authors
Techsalerator
Area covered
Falkland Islands (Malvinas), Aruba, Argentina, Cuba, Chile, Martinique, Montserrat, Dominican Republic, French Guiana, Ecuador, Americas, Latin America
Description
Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.

Key Features of the Dataset: Comprehensive Coverage:

The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:

News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:

The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:

Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:

The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:

Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.
F
Canadian French Scripted Monologue Speech Data for Healthcare
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-spanish-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada, French
Dataset funded by
FutureBeeAI
Description
Introduction
Introducing the Canadian French Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of French language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Canadian French, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers: 60 native Canadian French speakers.

•
Regional Balance: Participants are sourced from multiple regions across Canada, reflecting diverse dialects and linguistic traits.

•
Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications
•
Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•
Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•
Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•
Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names: Gender- and region-appropriate Canada names

•
Addresses: Varied local address formats spoken naturally

•
Dates & Times: References to appointment dates, times, follow-ups, and schedules

•
Medical Terminology: Common medical procedures, symptoms, and treatment references

•
Numbers & Measurements: Health data like dosages, vitals, and test result values

•
Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•
Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•
<b

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/

Ranking of languages spoken at home in the U.S. 2023

Explore at:

14 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2023

Area covered

United States

Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Clear search

Close search

Google apps

Main menu

Ranking of languages spoken at home in the U.S. 2023

The most spoken languages worldwide 2025

French Speech Recognition Dataset

French Speech Dataset for recognition task

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Metadata for the dataset

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

Wake Word French Dataset

MEDIA speech database for French

calliphonie

Canadian French Retail Scripted Monologue Speech Dataset

Introduction

Speech Data

Topic Diversity

Contextual Enrichment

Transcription

Metadata

OrienTel French as spoken in Morocco database

Facts and Figures 2015: Profiles of Official Language Immigrants: English...

English-French Bilingualism, 2001 (by census division)

Fon French Daily Dialogues Parallel Data

Canadian French Scripted Monologue Speech Dataset for BFSI

Introduction

Speech Data

Topic & Context Diversity

Contextual Elements

Transcription

Metadata

Applications and Use Cases

Canadian French Scripted Monologue Speech Data for Telecom

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

Metadata

2025 Green Card Report for Education Teaching French To Speakers Of Other...

tts-french-dataset

african_accented_french

Data from: Coronavirus France dataset

Context

Content

Acknowledgements

Contributions

Geonames - All Cities with a population > 1000

News Events Data in Latin America( Techsalerator)

Canadian French Scripted Monologue Speech Data for Healthcare

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

Ranking of languages spoken at home in the U.S. 2023