https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Portuguese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Portuguese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Portuguese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Portuguese speech models that understand and respond to authentic Portuguese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Portuguese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Portuguese speech and language AI applications:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Brazilian Portuguese Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Portuguese speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.
The dataset contains 30 hours of dual-channel call center recordings between native Brazilian Portuguese speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.
Such variety enhances your model’s ability to generalize across retail-specific voice interactions.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, making model training faster and more accurate.
Rich metadata is available for each participant and conversation:
This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.
This dataset is ideal for a range of voice AI and NLP applications:
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes general conversations, featuring Brazilian speakers from Portuguese with detailed metadata.
English(Portugal) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(532 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Portuguese Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Portuguese -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
The dataset includes 30 hours of dual-channel audio recordings between native Portuguese speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
These scenarios help models understand and respond to diverse traveler needs in real-time.
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
Extensive metadata enriches each call and speaker for better filtering and AI training:
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
The identification data is recorded in a quiet office environment and collected from a total of 200 speakers, including 104 males and 96 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as daily dialogues and news.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows:•Arabic: 8,119 entries•Catalan: 2,247 entries•Chinese (Simplified): 4,719 entries•Czech: 10,629 entries•Danish: 8,878 entries•Dutch: 12,538 entries•English: 24,663 entries•Greek: 9,725 entries•Hebrew: 9,138 entries•Italian: 16,798 entries•Japanese: 5,161 entries•Korean: 5,671 entries•Norwegian: 11,041 entries•Polish: 8,861 entries•Portuguese (Brazil): 9,250 entries•Portuguese (Portugal): 7,676 entries•Russian: 7,502 entries•Spanish: 2,297 entries•Swedish: 7,534 entries•Thai: 5,173 entries•Turkish: 6,491 entries
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This corpus was recorded in a quiet office environment over 2 channels and collected from a total of 200 speakers, including 102 males and 98 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples are stored as a sequence of 16-bit 44.1kHz for a total of 76 hours of speech per channel.
**GLOBO’S DATASET TERMS OF USE **
The present Terms of Use (“Terms”) regulates the license of use that GLOBO COMUNICAÇÃO E PARTICIPAÇÕES S.A., a company organized and existing in accordance with the Brazilian laws, with head offices at Rua Lopes Quintas 303, in the city and State of Rio de Janeiro, enrolled in the Brazilian tax registration number 27.865.757/0001-02 (hereinafter simply referred to as “Globo”), grants to the individual or entity that exercises the rights licensed under these Terms (“You”) for the use of audios referring to the reading of texts published on Jornal Nacional’s page on the “G1” website, owned by Globo (hereinafter referred to as “Contents”), which are stored at this dataset (“Dataset”).
**1. Grant of License of Use **
1.1. The scope of these Terms is a non-exclusive, non-sublicensable authorization, for an undefined term, hereby granted by Globo to You, to use the Contents made available via the Dataset for non-commercial purposes, exclusively for the deployment and promotion of research for development and improvement of technologies, including the elaboration of scientific articles, reports and/or any other type of academic publication. Any other form of use of the Contents stored in the Dataset is prohibited.
1.1.1. The authorization hereby granted is royalty-free, non-exclusive, and restricted to the use of the Contents made available in the Dataset under the terms and conditions mentioned herein. The storage of the Contents, as well as the capture, reproduction, use in any media, or by any other modality, or use in any medium, for commercial purposes or not, without previously obtaining Globo´s express authorization, is expressly prohibited. Thus, any form of use that has not been expressly authorized by Globo is prohibited. It is also expressly forbidden to assemble, alter, manipulate and/or transform the Contents, by any means or process. If the Contents contain Globo's brands or logos, they must be maintained by You, and the inclusion of any type of advertising, brand and/or sponsors, which may be related to the Contents, is prohibited, unless expressly authorized by Globo. Globo does not authorize the dubbing of voices/performances contained in the Content.
1.2. You may not, under any circumstances, grant or allow third parties to exploit, under any justification, whether for commercial purposes or not, in Brazil and/or abroad, the Contents, as well as its extracts, excerpts and parts, and You will be responsible for any use not permitted in this instrument, under penalty of being liable for misuse. You hereby undertake to reimburse Globo for all and any damages that it may suffer if such grant or unauthorized use occurs.
1.3. Globo reserves the right to revoke this authorization, at its sole discretion, without the need for any compensation, if it becomes aware of any non-compliance with the conditions established in these Terms.
1.4. The use of the Contents in VOD (video on demand) and OTT (over the top) services is expressly prohibited. Failure to comply with this item is cause for immediate termination of the license hereby granted, without prejudice to a claim compensation for losses and damages, at Globo’s sole discretion.
1.5. You undertake to use the Dataset and the Contents properly and diligently, exclusively for the purposes specified in these Terms, as well as to refrain from using them for purposes or as a mean of committing unlawful acts, prohibited by law and/or rules of these Terms and/or harmful to the rights and interests of Globo and/or third parties, subject to the provisions of item 1.3.
1.6. Globo reserves the right to, unilaterally, add or remove any functionality and/or Content from the Dataset, expand or reduce its storage capacity or usability, alter its presentation, as well as temporarily restrict or suspend its availability, or even terminate it permanently or temporarily, at any time, at its sole discretion, and without prior notice or consent.
1.7. Globo will use its best efforts to ensure the correct functioning of the Dataset without interference of any kind. However, considering the characteristics of the Internet environment, Globo does not guarantee the availability, infallibility and continuity of the Dataset, nor that it will be useful for performing any activity in particular, for which Globo exempts itself from any liability for direct or indirect damages of any nature that may result from the unavailability, failure and/or alteration in the Dataset.
**2. Intellectual Property **
2.1. Globo declares to be fully responsible for the authorization granted herein.
2.2. You acknowledge that all Contents made available in the Dataset are owned exclusively by Globo.
2.3. The reproduction or use of the Contents available in the Dataset in disagreement with the rules established in these Terms constitute a viol...
The identification data is recorded in a quiet office/home environment and collected from a total of 198 speakers, including 95 males and 103 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as education, family, food and pets.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Brazilian Portuguese Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Portuguese -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
The dataset includes 30 hours of dual-channel audio recordings between native Brazilian Portuguese speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
These scenarios help models understand and respond to diverse traveler needs in real-time.
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
Extensive metadata enriches each call and speaker for better filtering and AI training:
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
NURC-SP Corpus
NURC-SP Corpus CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 239.68 hours of audios ( 239.30 when filtered ) and their respective transcriptions (170k+ segmented audios). The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.
How to Use
The datasets library allows easy loading of the dataset with the load_dataset()… See the full description on the dataset page: https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Portuguese Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
The dataset provides comprehensive metadata for each audio recording and participant:
This database was created to study voice transcription in Portuguese language.
It consists of 31 recorded voice samples which were collected by a brazilian speaker
Credits: AlfaeBeto IAB Youtube Channel.
Is it possible to make an accurate local automate voice transcription with this dataset?
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Brazilian Portuguese Merged Speech Dataset (Derived from Common Voice)
This dataset is a preprocessed and merged version of the Mozilla Common Voice dataset for Brazilian Portuguese (pt-BR). It was created by filtering, merging, and normalizing audio clips to improve usability for speech recognition and TTS (Text-to-Speech) training.
📌 Dataset Details
Source: Derived from Common Voice Corpus 20.0 Language: 🇧🇷 Brazilian Portuguese (pt-BR) Format: MP3 (24 kHz, mono… See the full description on the dataset page: https://huggingface.co/datasets/firstpixel/pt-br_char.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Audio files of poem declamation coded as such: XYBPAPn or XYBPCAn where X = gender (F=female, M=male), XY = participant (e.g., F1 = first female participant), BP = Brazilian Portuguese. AP= poem of Adélia Prado CA = poem of Alberto Caeiro n = number of poems by AP or CA, where 2 = negative valence and 1 = positive valence.
Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning. Data The corpus contains 1,400 utterances (700 male and 700 female) of read and spontaneous speech spoken by two professional speakers. Utterances were transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels). The audio data was recorded at 16kHz (mono, 16-bit) using Pro Tools recording software and stored in flac compressed wav format. The acoustic environment was controlled for background conditions that occur in application environments.
Dataset card
This dataset includes ~80k samples of speech audio in Brazilian Portuguese. Samples have variable length ranging from 1 to 4 seconds, with a sampling rate of 16kHz. The metadata file includes speaker tags and corresponding labels for each sample, making it appropriate for speaker identification and speaker verification tasks.
Dataset Description
Audio samples are taken from three bigger corpora: C-ORAL Brasil, NURC Recife and NURC SP. Please take into… See the full description on the dataset page: https://huggingface.co/datasets/nnenufar/speakerVerification_PTBR.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Portuguese Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Portuguese speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.
The dataset contains 30 hours of dual-channel call center recordings between native Portuguese speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.
Such variety enhances your model’s ability to generalize across retail-specific voice interactions.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, making model training faster and more accurate.
Rich metadata is available for each participant and conversation:
This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.
This dataset is ideal for a range of voice AI and NLP applications:
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank) and a multilingual set of sentences in 28 languages (the PhraseBank, distributed separately under reference ELRA-T0377).
The WordBank contains 10,000 words for each language (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese, Hindi, Tamil, Bengali, Malayalam, Romanian, Ukrainian), XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).
All English headwords contain Cobuild learner’s dictionary style definitions and one or more examples of the word in context.
Lemmatized lists and verb tables are available for English, French, German, Spanish and Italian. Romanization is provided for Chinese, Japanese, Korean and Thai.
The corresponding audio files are available for 26 languages of the 32 languages (thus excluding Hindi, Tamil, Bengali, Malayalam, Romanian and Ukrainian) and are distributed in a package referenced ELRA-S0382.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Portuguese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Portuguese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Portuguese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Portuguese speech models that understand and respond to authentic Portuguese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Portuguese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Portuguese speech and language AI applications: