62 datasets found
  1. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  2. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  3. French Speech Recognition Dataset

    • kaggle.com
    Updated Jun 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). French Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/french-speech-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    French Speech Dataset for recognition task

    Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.

    By utilizing this dataset, researchers and developers can advance their understanding and capabilities in natural language processing (NLP), speech recognition, and machine learning technologies. - Get the data

    The dataset includes high-quality audio recordings with accurate transcriptions, making it ideal for training and evaluating speech recognition models.

    💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Metadata for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fb7af35fb0b3dabe083683bebd27fc5e5%2Fweweewew.PNG?generation=1739885543448162&alt=media" alt="">

    • Audio files: High-quality recordings in WAV format
    • Text transcriptions: Accurate and detailed transcripts for each audio segment
    • Speaker information: Metadata on native speakers, including gender and etc
    • Topics: Diverse domains such as general conversations, business and etc

    The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  4. s

    Wake Word French Dataset

    • shaip.com
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word French Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-french-dataset/
    Explore at:
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    French
    Description

    Home Wake Word French DatasetHigh-Quality French Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word French Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  5. E

    MEDIA speech database for French

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Mar 27, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). MEDIA speech database for French [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/
    Explore at:
    Dataset updated
    Mar 27, 2008
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    French
    Description

    The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The database is formatted following the SpeechDat conventions and it includes the following items:•1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers. •Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).•Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:1)agitée3A/ Z i t e•Documentation and statistics are also provided with the database.The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).

  6. calliphonie

    • huggingface.co
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datasets-CNRS (2024). calliphonie [Dataset]. https://huggingface.co/datasets/datasets-CNRS/calliphonie
    Explore at:
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    French National Centre for Scientific Researchhttp://www.cnrs.fr/
    Authors
    datasets-CNRS
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    [!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/calliphonie

    [!WARNING] Vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.

      Description
    

    Content and technical data:

      From Ref. 1
    

    Two speakers (a female and a male, native speakers of French) recorded the corpus. They produced each sentence according to two different instructions: (1) emphasis on a specific word of the sentence (generally the verb) and (2)… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/calliphonie.

  7. F

    Canadian French Retail Scripted Monologue Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Retail Scripted Monologue Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/retail-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada, French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian French Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of French language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.

    Speech Data

    This training dataset includes 6,000+ high-quality scripted audio recordings in Canadian French, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.

    Participant Diversity
    Speakers: 60 native French speakers from across Canada
    Geographic Coverage: Multiple Canada regions to ensure dialect and accent diversity
    Demographics: Participants aged 18 to 70, with a 60:40 male-to-female distribution
    Recording Details
    Nature of Recording: Scripted monologue-style speech prompts
    Duration: Each recording spans 5 to 30 seconds
    Audio Format: WAV format, mono channel, 16-bit depth, and 8kHz / 16kHz sample rates
    Environment: Recorded in quiet conditions, free from background noise and echo

    Topic Diversity

    This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:

    Customer Service Interactions
    Order Placement and Payment Processes
    Product and Service Inquiries
    Technical Support Queries
    General Information and Guidance
    Promotional and Sales Announcements
    Domain-Specific Service Statements

    Contextual Enrichment

    To increase training utility, prompts include contextual data such as:

    Region-Specific Names: Common Canada male and female names in diverse formats
    Addresses: Localized address variations spoken naturally
    Dates & Times: Realistic phrasing in delivery, promotions, and return policies
    Product References: Real-world product names, brands, and categories
    Numerical Data: Spoken numbers and prices used in transactions and offers
    Order IDs & Tracking Numbers: Common references in customer service calls

    These additions help your models learn to recognize structured and unstructured retail-related speech.

    Transcription

    Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.

    Content: Exact scripted prompts as spoken by the participant
    Format: Provided in plain text (.TXT) format with filenames matching the associated audio
    Quality Assurance: All transcripts are verified for accuracy by native French transcribers

    Metadata

    Detailed metadata is included to support filtering, analysis, and model evaluation:

    <span

  8. E

    OrienTel French as spoken in Morocco database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). OrienTel French as spoken in Morocco database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0185/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Morocco, French
    Description

    The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2 spontaneous items (for control)The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  9. A

    Facts and Figures 2015: Profiles of Official Language Immigrants: English...

    • data.amerigeoss.org
    • open.canada.ca
    xls
    Updated Jul 22, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canada (2019). Facts and Figures 2015: Profiles of Official Language Immigrants: English Speaking Permanent Residents inside Quebec [Dataset]. https://data.amerigeoss.org/dataset/caa61377-f34c-4f31-89ae-a57c8a73f99d
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 22, 2019
    Dataset provided by
    Canada
    Area covered
    Québec City, Quebec
    Description

    "Facts and Figures, Profiles of Official Language Immigrants: English Speaking Permanent Residents in Quebec presents the annual intake of English-speaking permanent residents in the province of Quebec by category of immigration from 2006 to 2015. The report examines selected characteristics for English-speaking permanent residents.

    “English-speaking immigrants” are defined by the following criteria: 1) permanent residents with English as Mother Tongue; 2) permanent residents with Mother Tongue other than English and with “English Only” as official language spoken (excluding “Both English and French” as official language spoken). Note that official language(s) spoken (English only, French only, both French and English, and neither language) are self-declared indicators of knowledge of an official language.

    Please note that in these datasets, the figures have been suppressed or rounded to prevent the identification of individuals when the datasets are compiled and compared with other publicly available statistics. Values between 0 and 5 are shown as “--“ and all other values are rounded to the nearest multiple of 5. This may result to the sum of the figures not equating to the totals indicated. "

  10. A

    English-French Bilingualism, 2001 (by census division)

    • data.amerigeoss.org
    • data.urbandatacentre.ca
    • +5more
    jp2, zip
    Updated Jul 22, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canada (2019). English-French Bilingualism, 2001 (by census division) [Dataset]. https://data.amerigeoss.org/tl/dataset/e52da9e1-8893-11e0-9c34-6cf049291510
    Explore at:
    zip, jp2Available download formats
    Dataset updated
    Jul 22, 2019
    Dataset provided by
    Canada
    Area covered
    French
    Description

    About 5 231 500 people reported to the 2001 Census that they were bilingual, compared with 4 841 300 five years earlier, an 8.1% increase. In 2001, these individuals represented 17.7% of the population, up from 17.0% in 1996. Nationally, 43.4% of francophones reported that they were bilingual, compared with 9.0% of anglophones. Within Quebec, the growth in the bilingualism rate from 1996 to 2001 was even greater than in the previous five-year period. In 2001, two out of every five individuals (40.8%) reported that they were bilingual, compared with 37.8% in 1996 and 35.4% in 1991.

  11. E

    Fon French Daily Dialogues Parallel Data

    • live.european-language-grid.eu
    • huggingface.co
    • +1more
    csv
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Fon French Daily Dialogues Parallel Data [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7709
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 11, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    We aim to collect, clean, and store corpora of Fon and French sentences for Natural Language Processing researches including Neural Machine Translation, Named Entity Recognition, etc. for Fon, a very low-resourced and endangered African native language.

    Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, Togo, and Nigeria - by about 2 million people.

    As training data is crucial to the high performance of a machine learning model, the aim of this project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon.

    Through crowdsourcing, Google Form Surveys, we gathered and cleaned #25377 parallel Fon-French# all based on daily conversations.

    To the crowdsourcing, creation, and cleaning of this version have contributed:

    1) Name: Bonaventure DOSSOUAffiliation: MSc Student in Data Engineering, Jacobs UniversityContact: femipancrace.dossou@gmail.com

    2) Name: Ricardo AHOUNVLAMEAffiliation: Student in LinguisticsContact: tontonjars@gmail.com

    3) Name: Fabroni YOCLOUNONAffiliation: Creator of the Label IamYourClounonContact: iamyourclounon@gmail.com

    4) Name: BeninLanguesAffiliation: BeninLanguesContact: https://beninlangues.com/

    5) Name: Chris EmezueAffiliation: MSc Student in Mathematics in Data Science, Technical University of MunichContact: chris.emezue@gmail.com

    _

    To join as a contributor, please contact us at: 1) https://twitter.com/bonadossou 2) https://twitter.com/ChrisEmezue 3) https://twitter.com/edAIOfficialOr contact Bonaventure Dossou (femipancrace.dossou@gmail.com), Chris Emezue (chris.emezue@gmail.com)_

    Clavier Fongbé (WebView): https://bonaventuredossou.github.io/clavierfongbe/ (Made by Bonaventure Dossou)Clavier Fongbé (Mobile Android Version): https://play.google.com/store/apps/details?id=com.fulbertodev.clavierfongbe&hl=en&gl=US (Fabroni Yoclounon, Bonventure Dossou et. al.)

  12. F

    Canadian French Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada, French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian French Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced French speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Canadian French, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Canadian French speakers.
    Regions: Diverse representation from various Canada provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Canadian French linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset

  13. F

    Canadian French Scripted Monologue Speech Data for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada, French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Presenting the Canadian French Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of French speech recognition and voice AI models specifically tailored for the telecommunications industry.

    Speech Data

    This dataset includes over 6,000 high-quality scripted prompt recordings in Canadian French, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.

    Participant Diversity
    Speakers: 60 native Canadian French speakers
    Geographic Distribution: Carefully selected from multiple regions across Canada to capture a wide spectrum of dialects and speaking styles
    Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years
    Recording Specifications
    Type: Scripted monologue prompts focused on telecom industry use cases
    Duration: Each audio clip ranges from 5 to 30 seconds
    Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz
    Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

    Topic Coverage

    The dataset reflects a wide variety of common telecom customer interactions, including:

    Customer onboarding and service inquiries
    Billing and payment questions
    Data plans and product information
    Technical support requests
    Network coverage discussions
    Regulatory compliance and policy information
    Upgrades, renewals, and service plan changes
    Domain-specific scripted interactions tailored to real-world telecom use cases

    Contextual Depth

    To maximize contextual richness, prompts include:

    Localized Names: Common Canada names in various formats
    Addresses: Region-specific address structures for realism
    Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)
    Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.
    Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures
    Service Providers: References to telecom companies and third-party service entities

    Transcription

    Each audio file is paired with an accurate, verbatim transcription for precise model training:

    Content: Transcriptions are direct representations of each recorded prompt
    Format: Plain text (.TXT), with filenames matching their corresponding audio files
    Verification: Every transcription is manually verified by native Canadian French linguists to ensure consistency and accuracy

    Metadata

    Detailed metadata is included to

  14. m

    2025 Green Card Report for Education Teaching French To Speakers Of Other...

    • myvisajobs.com
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyVisaJobs (2025). 2025 Green Card Report for Education Teaching French To Speakers Of Other Languages [Dataset]. https://www.myvisajobs.com/reports/green-card/major/education-teaching-french-to-speakers-of-other-languages/
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    MyVisaJobs
    License

    https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/

    Area covered
    French
    Variables measured
    Major, Salary, Petitions Filed
    Description

    A dataset that explores Green Card sponsorship trends, salary data, and employer insights for education teaching french to speakers of other languages in the U.S.

  15. h

    tts-french-dataset

    • huggingface.co
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mwila Bwalya David (2025). tts-french-dataset [Dataset]. https://huggingface.co/datasets/Buck26/tts-french-dataset
    Explore at:
    Dataset updated
    Aug 4, 2025
    Authors
    Mwila Bwalya David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    🇫🇷 French TTS Dataset

    This dataset contains French speech audio paired with clean transcriptions, intended for training text-to-speech models such as Spark-TTS or Coqui TTS.

      📁 Contents
    

    dataset.parquet — metadata file with audio paths, transcriptions, speaker info Audio/ — directory of all .wav files used for training

      📊 Dataset Structure
    

    The dataset.parquet file includes the following columns:

    Column Description

    audio Path to .wav file

    text… See the full description on the dataset page: https://huggingface.co/datasets/Buck26/tts-french-dataset.

  16. h

    african_accented_french

    • huggingface.co
    Updated Jun 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Théo Gigant (2022). african_accented_french [Dataset]. https://huggingface.co/datasets/gigant/african_accented_french
    Explore at:
    Dataset updated
    Jun 7, 2022
    Authors
    Théo Gigant
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Area covered
    French
    Description


    This corpus consists of approximately 22 hours of speech recordings. Transcripts are provided for all the recordings. The corpus can be divided into 3 parts:

    1. Yaounde

    Collected by a team from the U.S. Military Academy's Center for Technology Enhanced Language Learning (CTELL) in 2003 in Yaoundé, Cameroon. It has recordings from 84 speakers, 48 male and 36 female.

    1. CA16

    This part was collected by a RDECOM Science Team who participated in the United Nations exercise Central Accord 16 (CA16) in Libreville, Gabon in June 2016. The Science Team included DARPA's Dr. Boyan Onyshkevich and Dr. Aaron Lawson (SRI International), as well as RDECOM scientists. It has recordings from 125 speakers from Cameroon, Chad, Congo and Gabon.

    1. Niger

    This part was collected from 23 speakers in Niamey, Niger, Oct. 26-30 2015. These speakers were students in a course for officers and sergeants presented by Army trainers assigned to U.S. Army Africa. The data was collected by RDECOM Science & Technology Advisors Major Eddie Strimel and Mr. Bill Bergen.

  17. Data from: Coronavirus France dataset

    • kaggle.com
    Updated Mar 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lior Perez (2020). Coronavirus France dataset [Dataset]. https://www.kaggle.com/datasets/lperez/coronavirus-france-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lior Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France
    Description

    Context

    COVID-19 has infected many people in France.

    Content

    The dataset is no longer updated. It contains almost all French metropolitan regions plus overseas regions, updated on March 09 2020. If you want to help updating this dataset, see contributions section below.

    This dataset intention is to put all published information about COVID-19 patients in France in a csv file.

    Acknowledgements

    Source of data: Press releases of the French regional health agencies. Data transcripted in a csv by a GitHub community.

    This work is inspired by a similar work made in South Korea: kaggle dataset.

    Contributions

    We need more contributors to build this dataset and keep it updated. Join us on GitHub.

    Contributors: Lior Perez, Samia Drappeau, Manon Fourniol, Zoragna, Raphaël Presberg

  18. o

    Geonames - All Cities with a population > 1000

    • public.opendatasoft.com
    • data.smartidf.services
    • +2more
    csv, excel, geojson +1
    Updated Mar 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
    Explore at:
    csv, json, geojson, excelAvailable download formats
    Dataset updated
    Mar 10, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name

  19. News Events Data in Latin America( Techsalerator)

    • datarade.ai
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Techsalerator (2024). News Events Data in Latin America( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-latin-america-techsalerator-techsalerator
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Techsalerator LLC
    Authors
    Techsalerator
    Area covered
    Falkland Islands (Malvinas), Aruba, Argentina, Cuba, Chile, Martinique, Montserrat, Dominican Republic, French Guiana, Ecuador, Americas, Latin America
    Description

    Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.

    Key Features of the Dataset: Comprehensive Coverage:

    The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:

    News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:

    The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:

    Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

    Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:

    The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:

    Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.

  20. F

    Canadian French Scripted Monologue Speech Data for Healthcare

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada, French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Introducing the Canadian French Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of French language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.

    Speech Data

    This dataset includes over 6,000 high-quality scripted audio prompts recorded in Canadian French, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.

    Participant Diversity
    Speakers: 60 native Canadian French speakers.
    Regional Balance: Participants are sourced from multiple regions across Canada, reflecting diverse dialects and linguistic traits.
    Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.
    Recording Specifications
    Nature of Recordings: Scripted monologues based on healthcare-related use cases.
    Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.
    Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.
    Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

    Topic Coverage

    The prompts span a broad range of healthcare-specific interactions, such as:

    Patient check-in and follow-up communication
    Appointment booking and cancellation dialogues
    Insurance and regulatory support queries
    Medication, test results, and consultation discussions
    General health tips and wellness advice
    Emergency and urgent care communication
    Technical support for patient portals and apps
    Domain-specific scripted statements and FAQs

    Contextual Depth

    To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:

    Names: Gender- and region-appropriate Canada names
    Addresses: Varied local address formats spoken naturally
    Dates & Times: References to appointment dates, times, follow-ups, and schedules
    Medical Terminology: Common medical procedures, symptoms, and treatment references
    Numbers & Measurements: Health data like dosages, vitals, and test result values
    Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

    These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.

    Transcription

    Every audio recording is accompanied by a verbatim, manually verified transcription.

    Content: The transcription mirrors the exact scripted prompt recorded by the speaker.
    Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.
    <b

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Organization logo

Ranking of languages spoken at home in the U.S. 2023

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Search
Clear search
Close search
Google apps
Main menu