Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
TwitterEveryone who speaks a language, speaks it with an accent. A particular accent essentially reflects a person's linguistic background. When people listen to someone speak with a different accent from their own, they notice the difference, and they may even make certain biased social judgments about the speaker. The speech accent archive is established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. Native and non-native speakers of English all read the same English paragraph and are carefully recorded. The archive is constructed as a teaching tool and as a research tool. It is meant to be used by linguists as well as other people who simply wish to listen to and compare the accents of different English speakers. This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech. All of the linguistic analyses of the accents are available for public scrutiny. We welcome comments on the accuracy of our transcriptions and analyses.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
This visual speech dataset contains 1000 videos in Indian English language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
The dataset provides comprehensive metadata for each video recording and participant:
Facebook
Twitterhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/K7EQTEhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/K7EQTE
Introduction This file contains documentation on CSLU: Foreign Accented English Release 1.2, Linguistic Data Consortium (LDC) catalog number LDC2006S38 and isbn 1-58563-392-5. CSLU: Foreign Accented English Release 1.2 consists of continuous speech in English by native speakers of 22 different languages: Arabic, Cantonese, Czech, Farsi, French, German, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Malay, Polish, Portuguese (Brazilian and Iberian), Russian, Swedish, Spanish, Swahili, Tamil and Vietnamese. The corpus contains 4925 telephone-quality utterances, information about the speakers' linguistic backgrounds and perceptual judgments about the accents in the utterances. The speakers were asked to speak about themselves in English for 20 seconds. Three native speakers of American English independently listened to each utterance and judged the speakers' accents on a 4-point scale: negligible/no accent, mild accent, strong accent and very strong accent. This corpus is intended to support the study of the underlying characteristics of foreign accent and to enable research, development and evaluation of algorithms for the identification and understanding of accented speech. Some of the files in this corpus are also contained in CSLU: 22 Languages Corpus, LDC2005S26. Samples For an example of the data in this corpus, please listen to this audio sample. Copyright Portions © 2000-2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2007 Trustees of the University of Pennsylvania
Facebook
TwitterThere were a total of 20 children, 11 females and 9 males, ranging in age from 7 to 12. All of the children are native speakers of Telugu, an Indian regional language, who are learning English as a second language. All of the audio clips were acquired in a .wav file using the open source SurveyLex platform, which supports dual channel at 44.1 kHz and a data rate of 16 bits per sample. Every questionnaire is conducted as many times as the child can, up to a maximum of 10 times per child, to assess the variation in words and sentences.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for English -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
The dataset includes 30 hours of dual-channel audio recordings between native Indian English speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
These scenarios help models understand and respond to diverse traveler needs in real-time.
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
Extensive metadata enriches each call and speaker for better filtering and AI training:
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
Facebook
TwitterThis data was made as part of the Alaska Experimental Program to Stimulate Competitive Research (EPSCoR) Northern Test Case. The data can be used to look at language skills and retention over time. This data is the percent of the American and Alaska Native population that speaks only Other. Other languages include: Navajo, Other Native American languages, Hungarian, Arabic, Hebrew, African languages, All other languages. We chose only Natives because our interest is Alaska Natives. However, data for places like Anchorage might have a large other Native presence which should be examined. Source: American Community Survey (ACS) Extent: Data is for all communities in Alaska. Notes: We chose only Natives because our interest is Alaska Natives. However, data for places like Anchorage might have a large other Native presence which should be examined.
Facebook
Twitter(Excluding those less than 5 years old or speak only English)
Hawaii’s Limited English Proficient (LEP) Population: A Demographic and Socio-Economic Profile
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for "english_dialects"
Dataset Summary
This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The speakers self-identified as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English. The recording scripts… See the full description on the dataset page: https://huggingface.co/datasets/ylacombe/english_dialects.
Facebook
TwitterEnglish(India) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers( 2,100 Indian native speakers), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
The dataset features 30 hours of dual-channel call center recordings between native Indian English speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
Such domain-rich variety ensures model generalization across common real estate support conversations.
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
These transcriptions streamline ASR and NLP development for English real estate voice applications.
Detailed metadata accompanies each participant and conversation:
This enables smart filtering, dialect-focused model training, and structured dataset exploration.
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.
Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.
The dataset contains 30 hours of dual-channel call center recordings between native Indian English speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.
This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.
This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.
All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.
These transcriptions support fast, reliable model development for English voice AI applications in the delivery sector.
Detailed metadata is included for each participant and conversation:
This metadata aids in training specialized models, filtering demographics, and running advanced analytics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.
Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.
The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.
This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.
REFERENCES:
Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597
microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset
Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset abstract This dataset contains the results from 33 Flemish English as a Foreign Language (EFL) learners, who were exposed to eight native and non-native accents of English. These participants completed (i) a comprehensibility and accentedness rating task, followed by (ii) an orthographic transcription task. In the first task, listeners were asked to rate eight speakers of English on comprehensibility and accentedness on a nine-point scale (1 = easy to understand/no accent; 9 = hard to understand/strong accent). How Accentedness ratings and listeners' Familiarity with the different accents impacted on their Comprehensibility judgements was measured using a linear mixed-effects model. The orthographic transcription task, then, was used to verify how well listeners actually understood the different accents of English (i.e. intelligibility). To that end, participants' transcription Accuracy was measured as the number of correctly transcribed words and was estimated using a logistic mixed-effects model. Finally, the relation between listeners' self-reported ease of understanding the different speakers (comprehensibility) and their actual understanding of the speakers (intelligibility) was assessed using a linear mixed-effects regression. R code for the data analysis is provided. Article abstract This study investigates how well English as a Foreign Language (EFL) learners report understanding (i.e. comprehensibility) and actually understand (i.e. intelligibility) native and non-native accents of English, and how EFL learners’ self-reported ease of understanding and actual understanding of these accents are aligned. Thirty-three Dutch-speaking EFL learners performed a comprehensibility and accentedness judgement task, followed by an orthographic transcription task. The judgement task elicited listeners’ scalar ratings of authentic speech from eight speakers with traditional Inner, Outer and Expanding Circle accents. The transcription task assessed listeners’ actual understanding of 40 sentences produced by the same eight speakers. Speakers with Inner Circle accents were reported to be more comprehensible than speakers with non-Inner Circle accents, with Expanding Circle speakers being easier to understand than Outer Circle speakers. The strength of a speaker’s accent significantly affected listeners’ comprehensibility ratings. Most speakers were highly intelligible, with intelligibility scores ranging between 79% and 95%. Listeners’ self-reported ease of understanding the speakers in our study generally matched their actual understanding of those speakers, but no correlation between comprehensibility and intelligibility was detected. The study foregrounds the effect of native and non-native accents on comprehensibility and intelligibility, and highlights the importance of multidialectal listening skills.
Facebook
TwitterDataset abstract This dataset contains the results from 40 L1 British English, 80 Belgian Dutch and 80 European Spanish listeners, who were exposed to English speakers with a General British English, Newcastle and French accent. In the first experiment, participants completed (i) a demographic and linguistic background questionnaire, (ii) an orthographic transcription task and (iii) a vocabulary/general proficiency test (LexTALE; cf. Lemhöfer & Broersma, 2012). In the transcription task, participants listened to 120 stimulus sentences and were asked to write down what the speakers said. Crucially, each sentence contained one target word that was either phonetically unreduced or phonetically reduced. How well the different groups of listeners understood the speakers (i.e. Intelligibility), and more importantly the unreduced and reduced words, was measured as the number of correctly transcribed target words and was assessed using a linear mixed-effects regression model. In the second experiment, participants completed (i) a demographic and linguistic background questionnaire, (ii) an auditory lexical decision task and (iii) a vocabulary/general proficiency test (LexTALE; cf. Lemhöfer & Broersma, 2012). In the lexical decision task, participants were asked to decide whether a particular target word was a real word in English or a nonword. Participants' lexical decision responses (word vs. nonword) were analyzed using a mixed-effects logistic regression model, and their response times (i.e. time interval between stimulus offset and keypress) were analysed using a linear mixed-effects regression model. R code for the data analysis is provided. Article abstract This study examines to what extent phonetic reduction in different accents affects intelligibility for non-native (L2) listeners, and if similar reduction processes in listeners’ first language (L1) facilitate the recognition and processing of reduced word forms in the target language. In two experiments, 80 Dutch-speaking and 80 Spanish-speaking learners of English were presented with unreduced and reduced pronunciation variants in native and non-native English speech. Results showed that unreduced words are recognized more accurately and more quickly than reduced words, regardless of whether these variants occur in non-regionally, regionally or non-native accented speech. No differential effect of phonetic reduction on intelligibility and spoken word recognition was observed between Dutch-speaking and Spanish-speaking participants, despite the absence of strong vowel reduction in Spanish. These findings suggest that similar speech processes in listeners’ L1 and L2 do not invariably lead to an intelligibility benefit or a cross-linguistic facilitation effect in lexical access.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.
The dataset contains 30 hours of dual-channel call center recordings between native Indian English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.
This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.
Rich metadata is available for each participant and conversation:
Facebook
TwitterThis ZIP file contains an IVT file.
Facebook
TwitterMOBIO is a dataset for mobile face and speaker recognition. The dataset consists of bi-modal (audio and video) data taken from 150 people. The dataset has a female-male ratio of nearly 1:2 (99 males and 51 females) and was collected from August 2008 until July 2010 in six different sites from five different countries. This led to a diverse bi-modal dataset with both native and non-native English speakers. In total 12 sessions were captured for each client: 6 sessions for Phase I and 6 sessions for Phase II. The Phase I data consists of 21 questions with the question types ranging from: Short Response Questions, Short Response Free Speech, Set Speech, and Free Speech. The Phase II data consists of 11 questions with the question types ranging from: Short Response Questions, Set Speech, and Free Speech. A more detailed description of the questions asked of the clients is provided below. The database was recorded using two mobile devices: a mobile phone and a laptop computer. The mobile phone used to capture the database was a NOKIA N93i mobile while the laptop computer was a standard 2008 MacBook. The laptop was only used to capture part of the first session, this first session consists of data captured on both the laptop and the mobile phone. Detailed Description of Questions Please note that the answers to the Short Response Free Speech and Free Speech questions DO NOT necessarily relate to the question as the sole purpose is to have the subject speaking free speech, therefore, the answers to ALL of these questions are assumed to be false. 1. Short Response Questions The short response questions consisted of five pre-defined questions, which were: What is your name? – the user supplies their fake name What is your address? – the user supplies their fake address What is your birthdate? – the user supplies their fake birthdate What is your license number? – the user supplied their fake ID card number (the same for each person) What is your credit card number? – the user supplies their fake Card number 2. Short Response Free Speech There were five random questions taken form a list of 30-40 questions. The user had to answer these questions by speaking for approximately 5 seconds of recording (sometimes more and sometimes less). 3. Set Speech The users were asked to read pre-defined text out aloud. This text was designed to take longer than 10 seconds to utter and the participants were allowed to correct themselves while reading these paragraphs. The text that was read was: I have signed the MOBIO consent form and I understand that my biometric data is being captured for a database that might be made publicly available for research purposes. I understand that I am solely responsible for the content of my statements and my behaviour. I will ensure that when answering a question I do not provide any personal information in response to any question. 4. Free Speech The free speech session consisted of 10 random questions from a list of approximately 30 questions. The answers to each of these questions took approximately 10 seconds (sometimes less and sometimes more). Acknowledgements Elie Khoury, Laurent El-Shafey, Christopher McCool, Manuel Günther, Sébastien Marcel, “Bi-modal biometric authentication on mobile phones in challenging conditions”, Image and Vision Computing Volume 32, Issue 12, 2014. 10.1016/j.imavis.2013.10.001 https://publications.idiap.ch/index.php/publications/show/2689
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> Welcome to the FEIS (Fourteen-channel EEG with Imagined Speech) dataset. <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< The FEIS dataset comprises Emotiv EPOC+ [1] EEG recordings of: * 21 participants listening to, imagining speaking, and then actually speaking 16 English phonemes (see supplementary, below) * 2 participants listening to, imagining speaking, and then actually speaking 16 Chinese syllables (see supplementary, below) For replicability and for the benefit of further research, this dataset includes the complete experiment set-up, including participants' recorded audio and 'flashcard' screens for audio-visual prompts, Lua script and .mxs scenario for the OpenVibe [2] environment, as well as all Python scripts for the preparation and processing of data as used in the supporting studies (submitted in support of completion of the MSc Speech and Language Processing with the University of Edinburgh): * J. Clayton, "Towards phone classification from imagined speech using a lightweight EEG brain-computer interface," M.Sc. dissertation, University of Edinburgh, Edinburgh, UK, 2019. * S. Wellington, "An investigation into the possibilities and limitations of decoding heard, imagined and spoken phonemes using a low-density, mobile EEG headset," M.Sc. dissertation, University of Edinburgh, Edinburgh, UK, 2019. Each participant's data comprise 5 .csv files -- these are the 'raw' (unprocessed) EEG recordings for the 'stimuli', 'articulators' (see supplementary, below) 'thinking', 'speaking' and 'resting' phases per epoch for each trial -- alongside a 'full' .csv file with the end-to-end experiment recording (for the benefit of calculating deltas). To guard against software deprecation or inaccessability, the full repository of open-source software used in the above studies is also included. We hope for the FEIS dataset to be of some utility for future researchers, due to the sparsity of similar open-access databases. As such, this dataset is made freely available for all academic and research purposes (non-profit). ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> REFERENCING <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< If you use the FEIS dataset, please reference: * S. Wellington, J. Clayton, "Fourteen-channel EEG with Imagined Speech (FEIS) dataset," v1.0, University of Edinburgh, Edinburgh, UK, 2019. doi:10.5281/zenodo.3369178 ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> LEGAL <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< The research supporting the distribution of this dataset has been approved by the PPLS Research Ethics Committee, School of Philosophy, Psychology and Language Sciences, University of Edinburgh (reference number: 435-1819/2). This dataset is made available under the Open Data Commons Attribution License (ODC-BY): http://opendatacommons.org/licenses/by/1.0 ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ACKNOWLEDGEMENTS <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< The FEIS database was compiled by: Scott Wellington (MSc Speech and Language Processing, University of Edinburgh) Jonathan Clayton (MSc Speech and Language Processing, University of Edinburgh) Principal Investigators: Oliver Watts (Senior Researcher, CSTR, University of Edinburgh) Cassia Valentini-Botinhao (Senior Researcher, CSTR, University of Edinburgh) <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< METADATA ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> For participants, dataset refs 01 to 21: 01 - NNS 02 - NNS 03 - NNS, Left-handed 04 - E 05 - E, Voice heard as part of 'stimuli' portions of trials belongs to particpant 04, due to microphone becoming damaged and unusable prior to recording 06 - E 07 - E 08 - E, Ambidextrous 09 - NNS, Left-handed 10 - E 11 - NNS 12 - NNS, Only sessions one and two recorded (out of three total), as particpant had to leave the recording session early 13 - E 14 - NNS 15 - NNS 16 - NNS 17 - E 18 - NNS 19 - E 20 - E 21 - E E = native speaker of English NNS = non-native speaker of English (>= C1 level) For participants, dataset refs chinese-1 and chinese-2: chinese-1 - C chinese-2 - C, Voice heard as part of 'stimuli' portions of trials belongs to participant chinese-1 C = native speaker of Chinese <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< SUPPLEMENTARY ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> Under the international 10-20 system, the Emotiv EPOC+ headset 14 channels: F3 FC5 AF3 F7 T7 P7 O1 O2 P8 T8 F8 AF4 FC6 F4 The 16 English phonemes investigated in dataset refs 01 to 21: /i/ /u:/ /æ/ /ɔ:/ /m/ /n/ /ŋ/ /f/ /s/ /ʃ/ /v/ /z/ /ʒ/ /p /t/ /k/ The 16 Chinese syllables investigated in dataset refs chinese-1 and chinese-2: mā má mǎ mà mēng méng měng mèng duō duó duǒ duò tuī tuí tuǐ tuì All references to 'articulators' (e.g. as part of filenames) refer to the 1-second 'fixation point' portion of trials. The name is a layover from preliminary trials which were modelled on the KARA ONE database (http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html) [3]. <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< <>< ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> ><> [1] Emotiv EPOC+. https://emotiv.com/epoc. Accessed online 14/08/2019. [2] Y. Renard, F. Lotte, G. Gibert, M. Congedo, E. Maby, V. Delannoy, O. Bertrand, A. Lécuyer. “OpenViBE: An Open-Source Software Platform to Design, Test and Use Brain-Computer Interfaces in Real and Virtual Environments”, Presence: teleoperators and virtual environments, vol. 19, no 1, 2010. [3] S. Zhao, F. Rudzicz. "Classifying phonological categories in imagined and articulated speech." In Proceedings of ICASSP 2015, Brisbane Australia, 2015.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications: