Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
TwitterThis dataset provides 300 hours of Indian English conversational speech collected via smartphones from 390 native speakers. Dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(390 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of English language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.
This training dataset includes 6,000+ high-quality scripted audio recordings in Indian English, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.
This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:
To increase training utility, prompts include contextual data such as:
These additions help your models learn to recognize structured and unstructured retail-related speech.
Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.
Detailed metadata is included to support filtering, analysis, and model evaluation:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This video and gloss-based dataset has been meticulously crafted to enhance the precision and resilience of ISL (Indian Sign Language) gesture recognition and generation systems. Our goal in sharing this dataset is to contribute to the research community, providing a valuable resource for fellow researchers to explore and innovate in the realm of sign language recognition and generation.Overview of the Dataset: Comprising a diverse array of ISL gesture videos and gloss datasets. The term "gloss" in this context often refers to a written or spoken description of the meaning of a sign, allowing for the representation of sign language in a written form. The dataset includes information about the corresponding spoken or written language and the gloss for each sign. Key components of a sign language gloss dataset include ISL grammar that follows a layered approach, incorporating specific spatial indices for tense and a lexicon with compounds. It follows a unique word order based on noun, verb, object, adjective, or part of a question. Marathi sign language follows the subject-object-verb (SOV) form, facilitating comprehension and adaptation to regional languages. This Marathi sign language gloss aims to become a medium for everyday communication among deaf individuals. This dataset reflects a careful curation process, simulating real-world scenarios. The original videos showcase a variety of gestures performed by a professional signer capturing a broad spectrum of sign language expressions. Incorporating Realism with green screen with controlled lighting conditions. All videos within this dataset adhere to pixels, ensuring uniformity for data presentation and facilitating streamlined pre-processing and model development stored in a format compatible with various machine and Deep learning frameworks, these videos seamlessly integrate into the research pipeline
Facebook
TwitterEnglish(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(390 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Format
16 kHz, 16 bit, uncompressed wav, mono channel;
Content category
Dialogue based on given topics
Recording condition
Low background noise (indoor)
Recording device
Android smartphone, iPhone
Country
India(IN)
Language(Region) Code
en-IN
Language
English
Speaker
734 native speakers in total
Features of annotation
Transcription text, timestamp, speaker ID, gender, noise
Accuracy rate
Word Correct rate(WCR) 98%
Facebook
TwitterAccentDB is a multi-pairwise parallel corpus of structured and labelled accented speech. It contains speech samples from speakers of 4 non-native accents of English (8 speakers, 4 Indian languages); and also has a compilation of 4 native accents of English (4 countries, 13 speakers) and a metropolitan Indian accent (2 speakers). The dataset available here corresponds to release titled accentdb_extended on https://accentdb.github.io/#dataset.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('accentdb', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset containing 78 verified Spoken English locations in India with complete contact information, ratings, reviews, and location data.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
The dataset features 30 hours of dual-channel call center recordings between native Indian English speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
Such domain-rich variety ensures model generalization across common real estate support conversations.
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
These transcriptions streamline ASR and NLP development for English real estate voice applications.
Detailed metadata accompanies each participant and conversation:
This enables smart filtering, dialect-focused model training, and structured dataset exploration.
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset aids in fine-tuning assistance or chatbot models to comprehend both Hinglish and English through Hindi, enhancing their ability to understand and respond effectively in this hybrid language for optimal performance.
Hinglish is a hybrid language, a blend of Hindi and English, commonly spoken in India. It combines vocabulary and grammar from both languages, often used in text conversations. The Hinglish dataset is crucial for fine-tuning open-source language models like LLAMA-2, which lack exposure to such data in training. In contrast, GPT-3 and later models have been trained on Hinglish data, making them more adept at understanding this hybrid language.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset containing 55 verified Spoken English Classes locations in India with complete contact information, ratings, reviews, and location data.
Facebook
TwitterAttribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
The Indian Sign Language Alphabet Dataset is a curated collection of hand gesture images representing the alphabets (A–Z) in Indian Sign Language (ISL). It was specifically created to support the development of Gesture-Speech, a groundbreaking project that translates Indian Sign Language gestures into spoken words, making communication more accessible for the hearing and speech-impaired communities.
The dataset is intended for researchers, developers, and educators to build robust machine learning models for recognizing ISL gestures and advancing assistive technologies.
The dataset was developed as part of the Gesture-Speech project, an open-source initiative hosted on GitHub. Gesture-Speech focuses on real-time translation of Indian Sign Language gestures into spoken language using deep learning and computer vision. The project’s mission is to: - Empower differently-abled individuals by breaking communication barriers. - Promote inclusivity through assistive AI-driven technologies. - Enable the use of ISL in everyday interactions.
The dataset serves as the backbone of Gesture-Speech’s training pipeline, allowing the development of accurate and efficient recognition models.
Dataset Highlights:
Integration with Gesture-Speech:
The dataset follows a directory-based structure for classification tasks:
Indian_Sign_Language_Alphabet/
│── a/
│ ├── 0.jpg
│ ├── 1.jpg
│ ├── ...
│── b/
│ ├── 0.jpg
│ ├── 1.jpg
│ ├── ...
│── ...
│── z/
384x384 pixels for consistency..jpg format.The Gesture-Speech project can be found on GitHub. Key highlights of the project: - Uses TensorFlow and OpenCV to process hand gestures in real-time. - Translates ISL gestures into spoken English or Hindi phrases using a deep learning pipeline. - Open-source and extensible, encouraging contributions and further development.
This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users can: - Share: Copy and redistribute the material in any medium or format. - Adapt: Remix, transform, and build upon the material for any purpose, even commercially. - Attribution: Proper credit must be given to the dataset creator.
The dataset and the Gesture-Speech project are dedicated to promoting inclusivity and enabling new opportunities for the differently-abled. Special thanks to contributors and volunteers who helped create this dataset by providing diverse samples of ISL gestures.
Facebook
TwitterEnglish(India) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers( 2,100 Indian native speakers), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterThe Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.
Regions and Covered Countries with Primary Spoken Languages:
Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)
Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)
Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)
Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)
Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)
Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications
Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling
This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
This visual speech dataset contains 1000 videos in Indian English language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
The dataset provides comprehensive metadata for each video recording and participant:
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for English -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
The dataset includes 30 hours of dual-channel audio recordings between native Indian English speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
These scenarios help models understand and respond to diverse traveler needs in real-time.
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
Extensive metadata enriches each call and speaker for better filtering and AI training:
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
Facebook
TwitterComprehensive YouTube channel statistics for India Speak Daily, featuring 411,000 subscribers and 100,536,783 total views. This dataset includes detailed performance metrics such as subscriber growth, video views, engagement rates, and estimated revenue. The channel operates in the News-&-Politics category and is based in IN. Track 8,998 videos with daily and monthly performance data, including view counts, subscriber changes, and earnings estimates. Analyze growth trends, engagement patterns, and compare performance against similar channels in the same category.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.
Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.
The dataset contains 30 hours of dual-channel call center recordings between native Indian English speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.
This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.
This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.
All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.
These transcriptions support fast, reliable model development for English voice AI applications in the delivery sector.
Detailed metadata is included for each participant and conversation:
This metadata aids in training specialized models, filtering demographics, and running advanced analytics.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.
The dataset contains 30 hours of dual-channel call center recordings between native Indian English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.
This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.
Rich metadata is available for each participant and conversation:
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Indian English Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Indian English Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of English speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.
Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.
The dataset features 30 Hours of dual-channel call center conversations between native Indian English speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.
The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).
These real-world interactions help build speech models that understand healthcare domain nuances and user intent.
Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.
Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.
This dataset can be used across a range of healthcare and voice AI use cases:
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications: