https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Filipino General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Filipino speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Filipino communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Filipino speech models that understand and respond to authentic Filipino accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Filipino. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Filipino speech and language AI applications:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Filipino TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Filipino voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.
Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.
All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.
Only clean, production-grade audio makes it into the final dataset.
All voice artists are native Filipino speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.
Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.
While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To estimate county of residence of Filipinx healthcare workers who died of COVID-19, we retrieved data from the Kanlungan website during the month of December 2020.22 In deciding who to include on the website, the AF3IRM team that established the Kanlungan website set two standards in data collection. First, the team found at least one source explicitly stating that the fallen healthcare worker was of Philippine ancestry; this was mostly media articles or obituaries sharing the life stories of the deceased. In a few cases, the confirmation came directly from the deceased healthcare worker's family member who submitted a tribute. Second, the team required a minimum of two sources to identify and announce fallen healthcare workers. We retrieved 86 US tributes from Kanlungan, but only 81 of them had information on county of residence. In total, 45 US counties with at least one reported tribute to a Filipinx healthcare worker who died of COVID-19 were identified for analysis and will hereafter be referred to as “Kanlungan counties.” Mortality data by county, race, and ethnicity came from the National Center for Health Statistics (NCHS).24 Updated weekly, this dataset is based on vital statistics data for use in conducting public health surveillance in near real time to provide provisional mortality estimates based on data received and processed by a specified cutoff date, before data are finalized and publicly released.25 We used the data released on December 30, 2020, which included provisional COVID-19 death counts from February 1, 2020 to December 26, 2020—during the height of the pandemic and prior to COVID-19 vaccines being available—for counties with at least 100 total COVID-19 deaths. During this time period, 501 counties (15.9% of the total 3,142 counties in all 50 states and Washington DC)26 met this criterion. Data on COVID-19 deaths were available for six major racial/ethnic groups: Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Native Hawaiian or Other Pacific Islander, Non-Hispanic American Indian or Alaska Native, Non-Hispanic Asian (hereafter referred to as Asian American), and Hispanic. People with more than one race, and those with unknown race were included in the “Other” category. NCHS suppressed county-level data by race and ethnicity if death counts are less than 10. In total, 133 US counties reported COVID-19 mortality data for Asian Americans. These data were used to calculate the percentage of all COVID-19 decedents in the county who were Asian American. We used data from the 2018 American Community Survey (ACS) five-year estimates, downloaded from the Integrated Public Use Microdata Series (IPUMS) to create county-level population demographic variables.27 IPUMS is publicly available, and the database integrates samples using ACS data from 2000 to the present using a high degree of precision.27 We applied survey weights to calculate the following variables at the county-level: median age among Asian Americans, average income to poverty ratio among Asian Americans, the percentage of the county population that is Filipinx, and the percentage of healthcare workers in the county who are Filipinx. Healthcare workers encompassed all healthcare practitioners, technical occupations, and healthcare service occupations, including nurse practitioners, physicians, surgeons, dentists, physical therapists, home health aides, personal care aides, and other medical technicians and healthcare support workers. County-level data were available for 107 out of the 133 counties (80.5%) that had NCHS data on the distribution of COVID-19 deaths among Asian Americans, and 96 counties (72.2%) with Asian American healthcare workforce data. The ACS 2018 five-year estimates were also the source of county-level percentage of the Asian American population (alone or in combination) who are Filipinx.8 In addition, the ACS provided county-level population counts26 to calculate population density (people per 1,000 people per square mile), estimated by dividing the total population by the county area, then dividing by 1,000 people. The county area was calculated in ArcGIS 10.7.1 using the county boundary shapefile and projected to Albers equal area conic (for counties in the US contiguous states), Hawai’i Albers Equal Area Conic (for Hawai’i counties), and Alaska Albers Equal Area Conic (for Alaska counties).20
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Filipino General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Filipino speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Filipino communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Filipino speech models that understand and respond to authentic Filipino accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Filipino. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Filipino speech and language AI applications: