71 datasets found
  1. F

    European Portuguese General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). European Portuguese General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-portuguese-portugal
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Portuguese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Portuguese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Portuguese communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Portuguese speech models that understand and respond to authentic Portuguese accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Portuguese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Portuguese speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Portugal to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Portuguese speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Portuguese.
    Voice Assistants: Build smart assistants capable of understanding natural Portuguese conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  2. F

    Portuguese (Brazil) Call Center Data for Retail & E-Commerce AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Portuguese (Brazil) Call Center Data for Retail & E-Commerce AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/retail-call-center-conversation-portuguese-brazil
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Brazil
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Brazilian Portuguese Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Portuguese speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Brazilian Portuguese speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.

    Participant Diversity:
    Speakers: 60 native Brazilian Portuguese speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Brazil to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.

    Inbound Calls:
    Product Inquiries
    Order Cancellations
    Refund & Exchange Requests
    Subscription Queries, and more
    Outbound Calls:
    Order Confirmations
    Upselling & Promotions
    Account Updates
    Loyalty Program Offers
    Customer Verifications, and others

    Such variety enhances your model’s ability to generalize across retail-specific voice interactions.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, cough)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.

    Usage and Applications

    This dataset is ideal for a range of voice AI and NLP applications:

    Automatic Speech Recognition (ASR): Fine-tune Portuguese speech-to-text systems.

  3. m

    Portuguese speaker Speech Dataset in Brazilian

    • data.macgence.com
    mp3
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Portuguese speaker Speech Dataset in Brazilian [Dataset]. https://data.macgence.com/dataset/portuguese-speaker-speech-dataset-in-brazilian
    Explore at:
    mp3Available download formats
    Dataset updated
    Apr 2, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes general conversations, featuring Brazilian speakers from Portuguese with detailed metadata.

  4. 209 Hours - English(Portugal) Scripted Monologue Smartphone speech dataset

    • m.nexdata.ai
    Updated Oct 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 209 Hours - English(Portugal) Scripted Monologue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1023
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    English(Portugal) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(532 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  5. F

    European Portuguese Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). European Portuguese Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-portuguese-portugal
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Portugal
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Portuguese Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Portuguese -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Portuguese speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Portuguese contributors from our verified pool.
    Regions: Covering multiple Portugal provinces to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Portuguese speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left:

  6. i

    Brazilian Portuguese Speech Recognition Corpus (Desktop)

    • infinityai.ai
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataOceanAI (2024). Brazilian Portuguese Speech Recognition Corpus (Desktop) [Dataset]. www.infinityai.ai
    Explore at:
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    datatoceanai
    DataOceanAI
    Authors
    DataOceanAI
    Variables measured
    Product name, Recording duration, Recording language, Recording platform, Recording parameters, Recording environment, Product library number
    Description

    The identification data is recorded in a quiet office environment and collected from a total of 200 speakers, including 104 males and 96 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as daily dialogues and news.

  7. E

    AUDIO Human Voice Pronunciations - Portuguese (Portugal)

    • catalogue.elra.info
    Updated Oct 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2023). AUDIO Human Voice Pronunciations - Portuguese (Portugal) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0490_16/
    Explore at:
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    Portugal
    Description

    Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows:•Arabic: 8,119 entries•Catalan: 2,247 entries•Chinese (Simplified): 4,719 entries•Czech: 10,629 entries•Danish: 8,878 entries•Dutch: 12,538 entries•English: 24,663 entries•Greek: 9,725 entries•Hebrew: 9,138 entries•Italian: 16,798 entries•Japanese: 5,161 entries•Korean: 5,671 entries•Norwegian: 11,041 entries•Polish: 8,861 entries•Portuguese (Brazil): 9,250 entries•Portuguese (Portugal): 7,676 entries•Russian: 7,502 entries•Spanish: 2,297 entries•Swedish: 7,534 entries•Thai: 5,173 entries•Turkish: 6,491 entries

  8. E

    Portuguese Speech Recognition Corpus (Desktop+Mobile)

    • catalogue.elra.info
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2024). Portuguese Speech Recognition Corpus (Desktop+Mobile) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0228_122/
    Explore at:
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This corpus was recorded in a quiet office environment over 2 channels and collected from a total of 200 speakers, including 102 males and 98 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples are stored as a sequence of 16-bit 44.1kHz for a total of 76 hours of speech per channel.

  9. g Neutral Speech Male

    • kaggle.com
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mediatech Lab (2022). g Neutral Speech Male [Dataset]. https://www.kaggle.com/mediatechlab/gNeutralSpeech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mediatech Lab
    Description

    **GLOBO’S DATASET TERMS OF USE **

    The present Terms of Use (“Terms”) regulates the license of use that GLOBO COMUNICAÇÃO E PARTICIPAÇÕES S.A., a company organized and existing in accordance with the Brazilian laws, with head offices at Rua Lopes Quintas 303, in the city and State of Rio de Janeiro, enrolled in the Brazilian tax registration number 27.865.757/0001-02 (hereinafter simply referred to as “Globo”), grants to the individual or entity that exercises the rights licensed under these Terms (“You”) for the use of audios referring to the reading of texts published on Jornal Nacional’s page on the “G1” website, owned by Globo (hereinafter referred to as “Contents”), which are stored at this dataset (“Dataset”).

    **1. Grant of License of Use **

    1.1. The scope of these Terms is a non-exclusive, non-sublicensable authorization, for an undefined term, hereby granted by Globo to You, to use the Contents made available via the Dataset for non-commercial purposes, exclusively for the deployment and promotion of research for development and improvement of technologies, including the elaboration of scientific articles, reports and/or any other type of academic publication. Any other form of use of the Contents stored in the Dataset is prohibited.

    1.1.1. The authorization hereby granted is royalty-free, non-exclusive, and restricted to the use of the Contents made available in the Dataset under the terms and conditions mentioned herein. The storage of the Contents, as well as the capture, reproduction, use in any media, or by any other modality, or use in any medium, for commercial purposes or not, without previously obtaining Globo´s express authorization, is expressly prohibited. Thus, any form of use that has not been expressly authorized by Globo is prohibited. It is also expressly forbidden to assemble, alter, manipulate and/or transform the Contents, by any means or process. If the Contents contain Globo's brands or logos, they must be maintained by You, and the inclusion of any type of advertising, brand and/or sponsors, which may be related to the Contents, is prohibited, unless expressly authorized by Globo. Globo does not authorize the dubbing of voices/performances contained in the Content.

    1.2. You may not, under any circumstances, grant or allow third parties to exploit, under any justification, whether for commercial purposes or not, in Brazil and/or abroad, the Contents, as well as its extracts, excerpts and parts, and You will be responsible for any use not permitted in this instrument, under penalty of being liable for misuse. You hereby undertake to reimburse Globo for all and any damages that it may suffer if such grant or unauthorized use occurs.

    1.3. Globo reserves the right to revoke this authorization, at its sole discretion, without the need for any compensation, if it becomes aware of any non-compliance with the conditions established in these Terms.

    1.4. The use of the Contents in VOD (video on demand) and OTT (over the top) services is expressly prohibited. Failure to comply with this item is cause for immediate termination of the license hereby granted, without prejudice to a claim compensation for losses and damages, at Globo’s sole discretion.

    1.5. You undertake to use the Dataset and the Contents properly and diligently, exclusively for the purposes specified in these Terms, as well as to refrain from using them for purposes or as a mean of committing unlawful acts, prohibited by law and/or rules of these Terms and/or harmful to the rights and interests of Globo and/or third parties, subject to the provisions of item 1.3.

    1.6. Globo reserves the right to, unilaterally, add or remove any functionality and/or Content from the Dataset, expand or reduce its storage capacity or usability, alter its presentation, as well as temporarily restrict or suspend its availability, or even terminate it permanently or temporarily, at any time, at its sole discretion, and without prior notice or consent.

    1.7. Globo will use its best efforts to ensure the correct functioning of the Dataset without interference of any kind. However, considering the characteristics of the Internet environment, Globo does not guarantee the availability, infallibility and continuity of the Dataset, nor that it will be useful for performing any activity in particular, for which Globo exempts itself from any liability for direct or indirect damages of any nature that may result from the unavailability, failure and/or alteration in the Dataset.

    **2. Intellectual Property **

    2.1. Globo declares to be fully responsible for the authorization granted herein.

    2.2. You acknowledge that all Contents made available in the Dataset are owned exclusively by Globo.

    2.3. The reproduction or use of the Contents available in the Dataset in disagreement with the rules established in these Terms constitute a viol...

  10. h

    Brazilian Portuguese Conversational Speech Recognition Corpus (Mobile)

    • en.haitianruisheng.com
    Updated May 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataOceanAI (2024). Brazilian Portuguese Conversational Speech Recognition Corpus (Mobile) [Dataset]. en.haitianruisheng.com
    Explore at:
    Dataset updated
    May 7, 2024
    Dataset provided by
    datatoceanai
    DataOceanAI
    Authors
    DataOceanAI
    Variables measured
    Product name, Recording duration, Recording language, Recording platform, Recording parameters, Recording environment, Product library number
    Description

    The identification data is recorded in a quiet office/home environment and collected from a total of 198 speakers, including 95 males and 103 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as education, family, food and pets.

  11. F

    Portuguese (Brazil) Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Portuguese (Brazil) Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-portuguese-brazil
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Brazil
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Brazilian Portuguese Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Portuguese -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Brazilian Portuguese speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Brazilian Portuguese contributors from our verified pool.
    Regions: Covering multiple Brazil provinces to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Portuguese speech-to-text engines for travel platforms.
    <div style="margin-top:10px;

  12. h

    CORAA-NURC-SP-Audio-Corpus

    • huggingface.co
    Updated Sep 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NILC NLP (2024). CORAA-NURC-SP-Audio-Corpus [Dataset]. https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus
    Explore at:
    Dataset updated
    Sep 7, 2024
    Dataset authored and provided by
    NILC NLP
    Description

    NURC-SP Corpus

    NURC-SP Corpus CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 239.68 hours of audios ( 239.30 when filtered ) and their respective transcriptions (170k+ segmented audios). The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.

      How to Use
    

    The datasets library allows easy loading of the dataset with the load_dataset()… See the full description on the dataset page: https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus.

  13. F

    In-Car Speech Dataset: Portuguese (Portugal)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). In-Car Speech Dataset: Portuguese (Portugal) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/in-car-speech-dataset-portuguese
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Portugal
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Portuguese Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.

    Speech Data

    This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.

    Participant Diversity:
    Speakers: 50+ native Portuguese speakers from the FutureBeeAI Community.
    Regions: Ensures a balanced representation of Portuguese accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Nature: Scripted wake word and command type of audio recordings.
    Duration: Average duration of 5 to 20 seconds per audio recording.
    Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.

    Dataset Diversity

    Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.

    Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
    Different Cars: Data collection was carried out in different types and models of cars.
    Different Types of Voice Commands:
    Navigational Voice Commands
    Mobile Control Voice Commands
    Car Control Voice Commands
    Multimedia & Entertainment Commands
    General, Question Answer, Search Commands
    Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
    Morning
    Afternoon
    Evening
    Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
    Noise Level: Silent, Low Noise, Moderate Noise, High Noise
    Parking Location: Indoor, Outdoor
    Car Windows: Open, Closed
    Car AC: On, Off
    Car Engine: On, Off
    Car Movement: Stationary, Moving

    Metadata

    The dataset provides comprehensive metadata for each audio recording and participant:

    Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.

  14. Brazilian Portuguese Phonemes - Audio

    • kaggle.com
    Updated Mar 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Carvalho (2018). Brazilian Portuguese Phonemes - Audio [Dataset]. https://www.kaggle.com/jonascarvalho/brazilian-portuguese-phonemes-audio/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jonas Carvalho
    Area covered
    Brazil
    Description

    Context

    This database was created to study voice transcription in Portuguese language.

    Content

    It consists of 31 recorded voice samples which were collected by a brazilian speaker

    Acknowledgements

    Credits: AlfaeBeto IAB Youtube Channel.

    Inspiration

    Is it possible to make an accurate local automate voice transcription with this dataset?

  15. h

    pt-br_char

    • huggingface.co
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gilb (2025). pt-br_char [Dataset]. https://huggingface.co/datasets/firstpixel/pt-br_char
    Explore at:
    Dataset updated
    Jul 6, 2025
    Authors
    Gilb
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Brazilian Portuguese Merged Speech Dataset (Derived from Common Voice)

    This dataset is a preprocessed and merged version of the Mozilla Common Voice dataset for Brazilian Portuguese (pt-BR). It was created by filtering, merging, and normalizing audio clips to improve usability for speech recognition and TTS (Text-to-Speech) training.

      📌 Dataset Details
    

    Source: Derived from Common Voice Corpus 20.0 Language: 🇧🇷 Brazilian Portuguese (pt-BR) Format: MP3 (24 kHz, mono… See the full description on the dataset page: https://huggingface.co/datasets/firstpixel/pt-br_char.

  16. Pause study BP audio files

    • figshare.com
    wav
    Updated Oct 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plinio Barbosa (2022). Pause study BP audio files [Dataset]. http://doi.org/10.6084/m9.figshare.21325275.v1
    Explore at:
    wavAvailable download formats
    Dataset updated
    Oct 13, 2022
    Dataset provided by
    figshare
    Authors
    Plinio Barbosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Audio files of poem declamation coded as such: XYBPAPn or XYBPCAn where X = gender (F=female, M=male), XY = participant (e.g., F1 = first female participant), BP = Brazilian Portuguese. AP= poem of Adélia Prado CA = poem of Alberto Caeiro n = number of poems by AP or CA, where 2 = negative valence and 1 = positive valence.

  17. A

    Data from: Avatar Education Portuguese

    • abacus.library.ubc.ca
    iso, txt
    Updated Nov 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2018). Avatar Education Portuguese [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/BSQ4NP
    Explore at:
    txt(1308), iso(125351936)Available download formats
    Dataset updated
    Nov 15, 2018
    Dataset provided by
    Abacus Data Network
    Time period covered
    2018
    Area covered
    United States, Brazil
    Description

    Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning. Data The corpus contains 1,400 utterances (700 male and 700 female) of read and spontaneous speech spoken by two professional speakers. Utterances were transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels). The audio data was recorded at 16kHz (mono, 16-bit) using Pro Tools recording software and stored in flac compressed wav format. The acoustic environment was controlled for background conditions that occur in application environments.

  18. h

    speakerVerification_PTBR

    • huggingface.co
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Gabriel Lima (2024). speakerVerification_PTBR [Dataset]. https://huggingface.co/datasets/nnenufar/speakerVerification_PTBR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2024
    Authors
    João Gabriel Lima
    Description

    Dataset card

    This dataset includes ~80k samples of speech audio in Brazilian Portuguese. Samples have variable length ranging from 1 to 4 seconds, with a sampling rate of 16kHz. The metadata file includes speaker tags and corresponding labels for each sample, making it appropriate for speaker identification and speaker verification tasks.

      Dataset Description
    

    Audio samples are taken from three bigger corpora: C-ORAL Brasil, NURC Recife and NURC SP. Please take into… See the full description on the dataset page: https://huggingface.co/datasets/nnenufar/speakerVerification_PTBR.

  19. F

    European Portuguese Call Center Data for Retail & E-Commerce AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). European Portuguese Call Center Data for Retail & E-Commerce AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/retail-call-center-conversation-portuguese-portugal
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Portuguese Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Portuguese speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Portuguese speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.

    Participant Diversity:
    Speakers: 60 native Portuguese speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Portugal to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.

    Inbound Calls:
    Product Inquiries
    Order Cancellations
    Refund & Exchange Requests
    Subscription Queries, and more
    Outbound Calls:
    Order Confirmations
    Upselling & Promotions
    Account Updates
    Loyalty Program Offers
    Customer Verifications, and others

    Such variety enhances your model’s ability to generalize across retail-specific voice interactions.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, cough)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.

    Usage and Applications

    This dataset is ideal for a range of voice AI and NLP applications:

    Automatic Speech Recognition (ASR): Fine-tune Portuguese speech-to-text systems.
    <span

  20. E

    Collins Multilingual database (MLD) - WordBank

    • live.european-language-grid.eu
    • catalogue.elra.info
    Updated Dec 7, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Collins Multilingual database (MLD) - WordBank [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1496
    Explore at:
    Dataset updated
    Dec 7, 2016
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank) and a multilingual set of sentences in 28 languages (the PhraseBank, distributed separately under reference ELRA-T0377).

    The WordBank contains 10,000 words for each language (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese, Hindi, Tamil, Bengali, Malayalam, Romanian, Ukrainian), XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).

    All English headwords contain Cobuild learner’s dictionary style definitions and one or more examples of the word in context.

    Lemmatized lists and verb tables are available for English, French, German, Spanish and Italian. Romanization is provided for Chinese, Japanese, Korean and Thai.

    The corresponding audio files are available for 26 languages of the 32 languages (thus excluding Hindi, Tamil, Bengali, Malayalam, Romanian and Ukrainian) and are distributed in a package referenced ELRA-S0382.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). European Portuguese General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-portuguese-portugal

European Portuguese General Conversation Speech Dataset for ASR

European Portuguese General Conversation Speech Corpus

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by
FutureBeeAI
Description

Introduction

Welcome to the Portuguese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Portuguese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Portuguese communication.

Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Portuguese speech models that understand and respond to authentic Portuguese accents and dialects.

Speech Data

The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Portuguese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

Participant Diversity:
Speakers: 60 verified native Portuguese speakers from FutureBeeAI’s contributor community.
Regions: Representing various provinces of Portugal to ensure dialectal diversity and demographic balance.
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
Recording Details:
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
Duration: Each conversation ranges from 15 to 60 minutes.
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity

The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

Sample Topics Include:
Family & Relationships
Food & Recipes
Education & Career
Healthcare Discussions
Social Issues
Technology & Gadgets
Travel & Local Culture
Shopping & Marketplace Experiences, and many more.

Transcription

Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

Transcription Highlights:
Speaker-segmented dialogues
Time-coded utterances
Non-speech elements (pauses, laughter, etc.)
High transcription accuracy, achieved through double QA pass, average WER < 5%

These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

Metadata

The dataset comes with granular metadata for both speakers and recordings:

Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

Usage and Applications

This dataset is a versatile resource for multiple Portuguese speech and language AI applications:

ASR Development: Train accurate speech-to-text systems for Portuguese.
Voice Assistants: Build smart assistants capable of understanding natural Portuguese conversations.
<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

Search
Clear search
Close search
Google apps
Main menu