42 datasets found
  1. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  2. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  3. h

    french-speech-recognition-dataset

    • huggingface.co
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). french-speech-recognition-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/french-speech-recognition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2025
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    French Speech Dataset for recognition task

    Dataset comprises 547 hours of telephone dialogues in French, collected from 964 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems. By utilizing this dataset, researchers and developers can advance their understanding… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/french-speech-recognition-dataset.

  4. s

    Wake Word French Dataset

    • shaip.com
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word French Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-french-dataset/
    Explore at:
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    French
    Description

    Home Wake Word French DatasetHigh-Quality French Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word French Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  5. calliphonie

    • huggingface.co
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datasets-CNRS (2024). calliphonie [Dataset]. https://huggingface.co/datasets/datasets-CNRS/calliphonie
    Explore at:
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    French National Centre for Scientific Researchhttp://www.cnrs.fr/
    Authors
    datasets-CNRS
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    [!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/calliphonie

    [!WARNING] Vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.

      Description
    

    Content and technical data:

      From Ref. 1
    

    Two speakers (a female and a male, native speakers of French) recorded the corpus. They produced each sentence according to two different instructions: (1) emphasis on a specific word of the sentence (generally the verb) and (2)… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/calliphonie.

  6. F

    Canadian French Retail Scripted Monologue Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Retail Scripted Monologue Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/retail-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada, French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian French Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of French language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.

    Speech Data

    This training dataset includes 6,000+ high-quality scripted audio recordings in Canadian French, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.

    Participant Diversity
    Speakers: 60 native French speakers from across Canada
    Geographic Coverage: Multiple Canada regions to ensure dialect and accent diversity
    Demographics: Participants aged 18 to 70, with a 60:40 male-to-female distribution
    Recording Details
    Nature of Recording: Scripted monologue-style speech prompts
    Duration: Each recording spans 5 to 30 seconds
    Audio Format: WAV format, mono channel, 16-bit depth, and 8kHz / 16kHz sample rates
    Environment: Recorded in quiet conditions, free from background noise and echo

    Topic Diversity

    This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:

    Customer Service Interactions
    Order Placement and Payment Processes
    Product and Service Inquiries
    Technical Support Queries
    General Information and Guidance
    Promotional and Sales Announcements
    Domain-Specific Service Statements

    Contextual Enrichment

    To increase training utility, prompts include contextual data such as:

    Region-Specific Names: Common Canada male and female names in diverse formats
    Addresses: Localized address variations spoken naturally
    Dates & Times: Realistic phrasing in delivery, promotions, and return policies
    Product References: Real-world product names, brands, and categories
    Numerical Data: Spoken numbers and prices used in transactions and offers
    Order IDs & Tracking Numbers: Common references in customer service calls

    These additions help your models learn to recognize structured and unstructured retail-related speech.

    Transcription

    Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.

    Content: Exact scripted prompts as spoken by the participant
    Format: Provided in plain text (.TXT) format with filenames matching the associated audio
    Quality Assurance: All transcripts are verified for accuracy by native French transcribers

    Metadata

    Detailed metadata is included to support filtering, analysis, and model evaluation:

    <span

  7. F

    Canadian French Scripted Monologue Speech Data for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French, Canada
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Presenting the Canadian French Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of French speech recognition and voice AI models specifically tailored for the telecommunications industry.

    Speech Data

    This dataset includes over 6,000 high-quality scripted prompt recordings in Canadian French, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.

    Participant Diversity
    Speakers: 60 native Canadian French speakers
    Geographic Distribution: Carefully selected from multiple regions across Canada to capture a wide spectrum of dialects and speaking styles
    Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years
    Recording Specifications
    Type: Scripted monologue prompts focused on telecom industry use cases
    Duration: Each audio clip ranges from 5 to 30 seconds
    Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz
    Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

    Topic Coverage

    The dataset reflects a wide variety of common telecom customer interactions, including:

    Customer onboarding and service inquiries
    Billing and payment questions
    Data plans and product information
    Technical support requests
    Network coverage discussions
    Regulatory compliance and policy information
    Upgrades, renewals, and service plan changes
    Domain-specific scripted interactions tailored to real-world telecom use cases

    Contextual Depth

    To maximize contextual richness, prompts include:

    Localized Names: Common Canada names in various formats
    Addresses: Region-specific address structures for realism
    Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)
    Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.
    Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures
    Service Providers: References to telecom companies and third-party service entities

    Transcription

    Each audio file is paired with an accurate, verbatim transcription for precise model training:

    Content: Transcriptions are direct representations of each recorded prompt
    Format: Plain text (.TXT), with filenames matching their corresponding audio files
    Verification: Every transcription is manually verified by native Canadian French linguists to ensure consistency and accuracy

    Metadata

    Detailed metadata is included to

  8. 2025 Green Card Report for Education Teaching French To Speakers Of Other...

    • myvisajobs.com
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyVisaJobs (2025). 2025 Green Card Report for Education Teaching French To Speakers Of Other Languages [Dataset]. https://www.myvisajobs.com/reports/green-card/major/education-teaching-french-to-speakers-of-other-languages/
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    MyVisaJobs.com
    Authors
    MyVisaJobs
    License

    https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/

    Area covered
    French
    Variables measured
    Major, Salary, Petitions Filed
    Description

    A dataset that explores Green Card sponsorship trends, salary data, and employer insights for education teaching french to speakers of other languages in the U.S.

  9. E

    OrienTel French as spoken in Morocco database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). OrienTel French as spoken in Morocco database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0185/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    French, Morocco
    Description

    The OrienTel French as spoken in Morocco database comprises 530 Moroccan speakers of French (264 males, 266 females) recorded over the Moroccan fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3+1 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase + 1 additional (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2 spontaneous items (for control)The following age distribution has been obtained: 256 speakers are between 16 and 30, 210 speakers are between 31 and 45, 63 speakers are between 46 and 60, 1 speaker is over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  10. h

    african_accented_french

    • huggingface.co
    Updated Jun 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Théo Gigant (2022). african_accented_french [Dataset]. https://huggingface.co/datasets/gigant/african_accented_french
    Explore at:
    Dataset updated
    Jun 7, 2022
    Authors
    Théo Gigant
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Area covered
    French
    Description


    This corpus consists of approximately 22 hours of speech recordings. Transcripts are provided for all the recordings. The corpus can be divided into 3 parts:

    1. Yaounde

    Collected by a team from the U.S. Military Academy's Center for Technology Enhanced Language Learning (CTELL) in 2003 in Yaoundé, Cameroon. It has recordings from 84 speakers, 48 male and 36 female.

    1. CA16

    This part was collected by a RDECOM Science Team who participated in the United Nations exercise Central Accord 16 (CA16) in Libreville, Gabon in June 2016. The Science Team included DARPA's Dr. Boyan Onyshkevich and Dr. Aaron Lawson (SRI International), as well as RDECOM scientists. It has recordings from 125 speakers from Cameroon, Chad, Congo and Gabon.

    1. Niger

    This part was collected from 23 speakers in Niamey, Niger, Oct. 26-30 2015. These speakers were students in a course for officers and sergeants presented by Army trainers assigned to U.S. Army Africa. The data was collected by RDECOM Science & Technology Advisors Major Eddie Strimel and Mr. Bill Bergen.

  11. Common languages used for web content 2025, by share of websites

    • statista.com
    • ai-chatbox.pro
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2025
    Area covered
    Worldwide
    Description

    As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  12. e

    Percent of Population with Limited Ability to Speak English

    • coronavirus-resources.esri.com
    • data.amerigeoss.org
    • +1more
    Updated Jul 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urban Observatory by Esri (2019). Percent of Population with Limited Ability to Speak English [Dataset]. https://coronavirus-resources.esri.com/maps/78a668915cbc4bf983330608f3d687aa
    Explore at:
    Dataset updated
    Jul 3, 2019
    Dataset authored and provided by
    Urban Observatory by Esri
    Area covered
    Description

    This map shows the percent of population with a limited ability to speak English by census tract. Search to your community and investigate the top language needs in nearby census tracts.*DATA AS OF 2011-2015*Data Source: U.S. Census Bureau's American Community Survey 5-year estimates, 2011-2015, Table B16001.Complete list of all languages available in this data set (29):Spanish or Spanish Creole; French (including Patois, Cajun); French Creole; Italian; Portuguese; German; Yiddish; Greek; Russian; Polish; Serbo-Croatian; Armenian; Persian; Gujarati; Hindi; Urdu; Chinese; Japanese; Korean; Mon-Khmer, Cambodian; Hmong; Thai; Laotian; Vietnamese; Tagalog; Navajo; Hungarian; Arabic; Hebrew. Those who have limited English ability and speak other languages are included in the percentage depicted in the map, but other languages will not appear in the ranked list or in the table.Accompanying feature layer and viewing app are also available.

  13. News Events Data in Latin America( Techsalerator)

    • datarade.ai
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Techsalerator (2024). News Events Data in Latin America( Techsalerator) [Dataset]. https://datarade.ai/data-products/news-events-data-in-latin-america-techsalerator-techsalerator
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Techsalerator LLC
    Authors
    Techsalerator
    Area covered
    Americas, Latin America, Argentina, Falkland Islands (Malvinas), Cuba, Aruba, Chile, French Guiana, Ecuador, Martinique, Montserrat, Dominican Republic
    Description

    Techsalerator’s News Event Data in Latin America offers a detailed and extensive dataset designed to provide businesses, analysts, journalists, and researchers with an in-depth view of significant news events across the Latin American region. This dataset captures and categorizes key events reported from a wide array of news sources, including press releases, industry news sites, blogs, and PR platforms, offering valuable insights into regional developments, economic changes, political shifts, and cultural events.

    Key Features of the Dataset: Comprehensive Coverage:

    The dataset aggregates news events from numerous sources such as company press releases, industry news outlets, blogs, PR sites, and traditional news media. This broad coverage ensures a wide range of information from multiple reporting channels. Categorization of Events:

    News events are categorized into various types including business and economic updates, political developments, technological advancements, legal and regulatory changes, and cultural events. This categorization helps users quickly locate and analyze information relevant to their interests or sectors. Real-Time Updates:

    The dataset is updated regularly to include the most recent events, ensuring users have access to the latest news and can stay informed about current developments. Geographic Segmentation:

    Events are tagged with their respective countries and regions within Latin America. This geographic segmentation allows users to filter and analyze news events based on specific locations, facilitating targeted research and analysis. Event Details:

    Each event entry includes comprehensive details such as the date of occurrence, source of the news, a description of the event, and relevant keywords. This thorough detailing helps in understanding the context and significance of each event. Historical Data:

    The dataset includes historical news event data, enabling users to track trends and perform comparative analysis over time. This feature supports longitudinal studies and provides insights into how news events evolve. Advanced Search and Filter Options:

    Users can search and filter news events based on criteria such as date range, event type, location, and keywords. This functionality allows for precise and efficient retrieval of relevant information. Latin American Countries Covered: South America: Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela Central America: Belize Costa Rica El Salvador Guatemala Honduras Nicaragua Panama Caribbean: Cuba Dominican Republic Haiti (Note: Primarily French-speaking but included due to geographic and cultural ties) Jamaica Trinidad and Tobago Benefits of the Dataset: Strategic Insights: Businesses and analysts can use the dataset to gain insights into significant regional developments, economic conditions, and political changes, aiding in strategic decision-making and market analysis. Market and Industry Trends: The dataset provides valuable information on industry-specific trends and events, helping users understand market dynamics and emerging opportunities. Media and PR Monitoring: Journalists and PR professionals can track relevant news across Latin America, enabling them to monitor media coverage, identify emerging stories, and manage public relations efforts effectively. Academic and Research Use: Researchers can utilize the dataset for longitudinal studies, trend analysis, and academic research on various topics related to Latin American news and events. Techsalerator’s News Event Data in Latin America is a crucial resource for accessing and analyzing significant news events across the region. By providing detailed, categorized, and up-to-date information, it supports effective decision-making, research, and media monitoring across diverse sectors.

  14. Dialogue_francais_role_play

    • huggingface.co
    Updated Jan 29, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datasets-CNRS (2009). Dialogue_francais_role_play [Dataset]. https://huggingface.co/datasets/datasets-CNRS/Dialogue_francais_role_play
    Explore at:
    Dataset updated
    Jan 29, 2009
    Dataset provided by
    French National Centre for Scientific Researchhttp://www.cnrs.fr/
    Authors
    datasets-CNRS
    Area covered
    French
    Description

    [!NOTE] Dataset origin: https://www.ortolang.fr/market/corpora/sldr000738 and https://www.ortolang.fr/market/corpora/sldr000739

    [!CAUTION] Ce jeu de données ne contient que les transcriptions. Pour récupérer les audios (sldr000738), vous devez vous rendre sur le site d'Ortholang et vous connecter afin de télécharger les données.

      Description
    

    Dialogue in French (role-play). The speech material used here contains dialogues spoken by 38 native speakers of French (10 pairs of… See the full description on the dataset page: https://huggingface.co/datasets/datasets-CNRS/Dialogue_francais_role_play.

  15. F

    Canadian French Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian French Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-spanish-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French, Canada
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian French Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced French speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Canadian French, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Canadian French speakers.
    Regions: Diverse representation from various Canada provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Canadian French linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset

  16. E

    OrienTel French as spoken in Tunisia database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). OrienTel French as spoken in Tunisia database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0188/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Tunisia, French
    Description

    The OrienTel French as spoken in Tunisia database comprises 576 Tunisian speakers of French (290 males, 286 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2+3 spontaneous items (for control)The following age distribution has been obtained: 2 speakers are below 16, 407 speakers are between 16 and 30, 104 speakers are between 31 and 45, 59 speakers are between 46 and 60, 4 speakers are over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  17. E

    MEDIA speech database for French

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Mar 27, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). MEDIA speech database for French [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/
    Explore at:
    Dataset updated
    Mar 27, 2008
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    French
    Description

    The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The database is formatted following the SpeechDat conventions and it includes the following items:•1,258 recorded sessions for a total of 70 hours of speech. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers. •Manual transcription of each session in XML format. Label files were created with the free transcription tool Transcriber (TRS files).•Phonetic lexicon containing all the words spoken in the database. Column 1 contains the orthography of the French word. Column 2 shows the frequency of the word. Column 3 contains the pronunciation in SAMPA format. Here is a sample entry of the lexicon:1)agitée3A/ Z i t e•Documentation and statistics are also provided with the database.The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).

  18. E

    BREF-120 - A large corpus of French read speech

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). BREF-120 - A large corpus of French read speech [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0067/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    French
    Description

    BREF-120 resulted from the efforts of LIMSI-CNRS researchers under sponsorship from the GDR-PRC CHM, the ACCT (OFIL), the EEC (ESPRIT Polyglot project), and the Aupelf-Uref.A sub-set of BREF-120 is BREF-80 (ELRA-S0006), which consists of about 50-60 sentences per speaker and recordings conducted only with a Shure microphone. In BREF-80, the sentences were chosen to cover as many prompts as possible.The BREF-120 corpus was designed to provide read speech data for the development and evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a large corpus of continuous speech for the acquisition of acoustic-phonetic knowledge of spoken French.BREF-120 is a large read-speech corpus containing over 100 hours of speech material, from 120 speakers (55 males and 65 females). The text materials were selected verbatim from extracts of the French newspaper "Le Monde". Each of 80 speakers read approximately 10,000 words (about 650 sentences) of text, and another 40 speakers each read about half that amount. Simultaneous recordings were made in a sound-proof room using a Shure SM10 microphone and a Crown PCC160 microphone and were monitored to assure their contents. The speech signal was sampled at 16 kHz and digitised with 16 bits. The BREF-120 corpus contains 28 CDs; numbers 1-13 contain the Shure recorded data and numbers 14-28 contain the Crown recorded data

  19. o

    Data and Code for: The Impact of Host Language Proficiency on Migrants'...

    • openicpsr.org
    delimited
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Schmid (2023). Data and Code for: The Impact of Host Language Proficiency on Migrants' Employment Outcomes [Dataset]. http://doi.org/10.3886/E183861V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jan 6, 2023
    Dataset provided by
    American Economic Association
    Authors
    Lukas Schmid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2008 - Dec 31, 2017
    Area covered
    Switzerland
    Description

    This paper estimates the economic gains from proficiency in the host country's language on migrants' employment outcomes by exploiting the exogenous placement of refugees to Swiss cantons and a sharp language border dividing German- and French-speaking regions. Using administrative data on African refugees who applied for Swiss asylum between 2008 and 2017, I compare French-speaking refugees assigned to the French-speaking region to French-speaking refugees assigned to the German-speaking region, and adjust for common regional differences with outcomes from English-speaking African refugees. The results suggest that language proficiency more than doubles the employment level in the first five years after arrival.

  20. d

    Global English Speech with Accent Conversational Dataset — Multi-Region...

    • datarade.ai
    .wav
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Global English Speech with Accent Conversational Dataset — Multi-Region Validated Speech with Gender, Age & Metadata for AI & NLP Training [Dataset]. https://datarade.ai/data-products/global-english-speech-with-accent-conversational-dataset-mu-filemarket
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Jul 21, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Nicaragua, United States Minor Outlying Islands, Tonga, Iceland, Comoros, Haiti, Montenegro, Cook Islands, Bangladesh, Yemen
    Description

    The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.

    Regions and Covered Countries with Primary Spoken Languages:

    Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)

    Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)

    Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)

    Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)

    Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)

    Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications

    Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling

    This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Organization logo

Ranking of languages spoken at home in the U.S. 2023

Explore at:
15 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Search
Clear search
Close search
Google apps
Main menu