Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Portuguese Voice Emotion Dataset
*This dataset contains high-quality (“A-grade”) data. It has been carefully curated, cleaned, and verified to ensure accuracy, completeness, and consistency, making it suitable for high-stakes or production-grade model training.
Dataset Summary
This dataset comprises high-quality Portuguese speech recordings designed for training and evaluating Speech Emotion Recognition (SER) models. The dataset contains voice samples expressing four… See the full description on the dataset page: https://huggingface.co/datasets/Kratos-AI/Portuguese-audio-dataset.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Portuguese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Portuguese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Portuguese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Portuguese speech models that understand and respond to authentic Portuguese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Portuguese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Portuguese speech and language AI applications:
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes general conversations, featuring Brazilian speakers from Portuguese with detailed metadata.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Portuguese Speech Dataset for recognition task
Dataset comprises 406 hours of telephone dialogues in Portuguese, collected from 590 native speakers across various topics and domains. This dataset boasts an impressive 98% word accuracy rate, making it a valuable resource for advancing speech recognition technology. By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/portuguese-speech-recognition-dataset.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Brazilian Portuguese Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Portuguese speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.
The dataset contains 30 hours of dual-channel call center recordings between native Brazilian Portuguese speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.
Such variety enhances your model’s ability to generalize across retail-specific voice interactions.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, making model training faster and more accurate.
Rich metadata is available for each participant and conversation:
This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.
This dataset is ideal for a range of voice AI and NLP applications:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Brazilian Portuguese Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Portuguese language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Portuguese speech data.
This dataset features over 6,000 high-quality scripted monologue recordings in Brazilian Portuguese. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.
The dataset covers a wide variety of general conversation scenarios, including:
To enhance authenticity, the prompts include:
Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.
Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.
Rich metadata is included for detailed filtering and analysis:
This dataset can power a variety of Portuguese language AI technologies, including:
**GLOBO’S DATASET TERMS OF USE **
The present Terms of Use (“Terms”) regulates the license of use that GLOBO COMUNICAÇÃO E PARTICIPAÇÕES S.A., a company organized and existing in accordance with the Brazilian laws, with head offices at Rua Lopes Quintas 303, in the city and State of Rio de Janeiro, enrolled in the Brazilian tax registration number 27.865.757/0001-02 (hereinafter simply referred to as “Globo”), grants to the individual or entity that exercises the rights licensed under these Terms (“You”) for the use of audios referring to the reading of texts published on Jornal Nacional’s page on the “G1” website, owned by Globo (hereinafter referred to as “Contents”), which are stored at this dataset (“Dataset”).
**1. Grant of License of Use **
1.1. The scope of these Terms is a non-exclusive, non-sublicensable authorization, for an undefined term, hereby granted by Globo to You, to use the Contents made available via the Dataset for non-commercial purposes, exclusively for the deployment and promotion of research for development and improvement of technologies, including the elaboration of scientific articles, reports and/or any other type of academic publication. Any other form of use of the Contents stored in the Dataset is prohibited.
1.1.1. The authorization hereby granted is royalty-free, non-exclusive, and restricted to the use of the Contents made available in the Dataset under the terms and conditions mentioned herein. The storage of the Contents, as well as the capture, reproduction, use in any media, or by any other modality, or use in any medium, for commercial purposes or not, without previously obtaining Globo´s express authorization, is expressly prohibited. Thus, any form of use that has not been expressly authorized by Globo is prohibited. It is also expressly forbidden to assemble, alter, manipulate and/or transform the Contents, by any means or process. If the Contents contain Globo's brands or logos, they must be maintained by You, and the inclusion of any type of advertising, brand and/or sponsors, which may be related to the Contents, is prohibited, unless expressly authorized by Globo. Globo does not authorize the dubbing of voices/performances contained in the Content.
1.2. You may not, under any circumstances, grant or allow third parties to exploit, under any justification, whether for commercial purposes or not, in Brazil and/or abroad, the Contents, as well as its extracts, excerpts and parts, and You will be responsible for any use not permitted in this instrument, under penalty of being liable for misuse. You hereby undertake to reimburse Globo for all and any damages that it may suffer if such grant or unauthorized use occurs.
1.3. Globo reserves the right to revoke this authorization, at its sole discretion, without the need for any compensation, if it becomes aware of any non-compliance with the conditions established in these Terms.
1.4. The use of the Contents in VOD (video on demand) and OTT (over the top) services is expressly prohibited. Failure to comply with this item is cause for immediate termination of the license hereby granted, without prejudice to a claim compensation for losses and damages, at Globo’s sole discretion.
1.5. You undertake to use the Dataset and the Contents properly and diligently, exclusively for the purposes specified in these Terms, as well as to refrain from using them for purposes or as a mean of committing unlawful acts, prohibited by law and/or rules of these Terms and/or harmful to the rights and interests of Globo and/or third parties, subject to the provisions of item 1.3.
1.6. Globo reserves the right to, unilaterally, add or remove any functionality and/or Content from the Dataset, expand or reduce its storage capacity or usability, alter its presentation, as well as temporarily restrict or suspend its availability, or even terminate it permanently or temporarily, at any time, at its sole discretion, and without prior notice or consent.
1.7. Globo will use its best efforts to ensure the correct functioning of the Dataset without interference of any kind. However, considering the characteristics of the Internet environment, Globo does not guarantee the availability, infallibility and continuity of the Dataset, nor that it will be useful for performing any activity in particular, for which Globo exempts itself from any liability for direct or indirect damages of any nature that may result from the unavailability, failure and/or alteration in the Dataset.
**2. Intellectual Property **
2.1. Globo declares to be fully responsible for the authorization granted herein.
2.2. You acknowledge that all Contents made available in the Dataset are owned exclusively by Globo.
2.3. The reproduction or use of the Contents available in the Dataset in disagreement with the rules established in these Terms constitute a viol...
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 201 speakers, including 90 males and 111 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily dialogues. Speech samples are stored as a sequence of 16-bit 16kHz for a total of 113.7 hours of speech per channel.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows:•Arabic: 8,119 entries•Catalan: 2,247 entries•Chinese (Simplified): 4,719 entries•Czech: 10,629 entries•Danish: 8,878 entries•Dutch: 12,538 entries•English: 24,663 entries•Greek: 9,725 entries•Hebrew: 9,138 entries•Italian: 16,798 entries•Japanese: 5,161 entries•Korean: 5,671 entries•Norwegian: 11,041 entries•Polish: 8,861 entries•Portuguese (Brazil): 9,250 entries•Portuguese (Portugal): 7,676 entries•Russian: 7,502 entries•Spanish: 2,297 entries•Swedish: 7,534 entries•Thai: 5,173 entries•Turkish: 6,491 entries
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Portuguese Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
This visual speech dataset contains 1000 videos in Portuguese language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
The dataset provides comprehensive metadata for each video recording and participant:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Portuguese Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
The dataset provides comprehensive metadata for each audio recording and participant:
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Brazilian Portuguese Merged Speech Dataset (Derived from Common Voice)
This dataset is a preprocessed and merged version of the Mozilla Common Voice dataset for Brazilian Portuguese (pt-BR). It was created by filtering, merging, and normalizing audio clips to improve usability for speech recognition and TTS (Text-to-Speech) training.
📌 Dataset Details
Source: Derived from Common Voice Corpus 20.0 Language: 🇧🇷 Brazilian Portuguese (pt-BR) Format: MP3 (24 kHz, mono… See the full description on the dataset page: https://huggingface.co/datasets/firstpixel/pt-br_char.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Brazilian Portuguese Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Portuguese language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Brazilian Portuguese, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
NURC-SP Corpus
NURC-SP Corpus CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 239.68 hours of audios ( 239.30 when filtered ) and their respective transcriptions (170k+ segmented audios). The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.
How to Use
The datasets library allows easy loading of the dataset with the load_dataset() function.… See the full description on the dataset page: https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus.
Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning. Data The corpus contains 1,400 utterances (700 male and 700 female) of read and spontaneous speech spoken by two professional speakers. Utterances were transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels). The audio data was recorded at 16kHz (mono, 16-bit) using Pro Tools recording software and stored in flac compressed wav format. The acoustic environment was controlled for background conditions that occur in application environments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Audio files of poem declamation coded as such: XYBPAPn or XYBPCAn where X = gender (F=female, M=male), XY = participant (e.g., F1 = first female participant), BP = Brazilian Portuguese. AP= poem of Adélia Prado CA = poem of Alberto Caeiro n = number of poems by AP or CA, where 2 = negative valence and 1 = positive valence.
Nexdata has off-the-shelf 15,000 hours Machine Learning (ML) Data of 8kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Audio files of poem declamation coded as such:
XYEPAPn or XYEPCAn
where X = gender (F=female, M=male),
XY = participant (e.g., F1 = first female participant),
EP = European Portuguese.
AP= poem of Adélia Prado
CA = poem of Alberto Caeiro
n = number of poems by AP or CA, where 2 = negative valence and 1 = positive valence.
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Fundamental Portuguese Corpus is a corpus of spoken language, collected between 1970 and 1974, composed of 1800 recordings (500 hours) made in Continental Portugal and the Islands. Of these 1800 conversations, a sample was selected and transcribed.
The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text. The plain text files also have automatically assigned POS-tag information. The transcriptions of the corpus are also available in html format. The characters have been encoded in UTF-8.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The project will build a corpus of Malaccan Portuguese Creole, which is spoken by about 1000 people in the Portuguese Settlement in Melaka, Malaysia. The purpose of this project is to create a database of video and audio recordings comprising a variety of speaking contexts. The recordings will be paired with time-aligned orthographic transcriptions and annotations. The annotations will allow further linguistic analysis to be carried out while the corpus will serve as a digital resource for the community.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Portuguese Voice Emotion Dataset
*This dataset contains high-quality (“A-grade”) data. It has been carefully curated, cleaned, and verified to ensure accuracy, completeness, and consistency, making it suitable for high-stakes or production-grade model training.
Dataset Summary
This dataset comprises high-quality Portuguese speech recordings designed for training and evaluating Speech Emotion Recognition (SER) models. The dataset contains voice samples expressing four… See the full description on the dataset page: https://huggingface.co/datasets/Kratos-AI/Portuguese-audio-dataset.