CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਾਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, ug Podcast Dataset para sa AI ug Speech Models Kontaka Kami Call-Center Data General Conversation Data Podcast Data Call-Center-58312 Data Call-Center-91938 .elementor-element.elementor-element-9a20{padding:0px XNUMXpx…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Punjabi Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Punjabi-speaking regions.
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
This dataset reflects the natural flow of Punjabi healthcare communication and includes:
These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversations range from simple inquiries to complex advisory sessions, including:
Each conversation typically includes these structural components:
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Available in JSON, CSV, and TXT formats, each conversation includes:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Hoyga Punjabi Dataset. .elementor-element.elementor-element-58312a91938{padding:9px 20px…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Punjabi Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Punjabi speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.
Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.
The dataset features 30 Hours of dual-channel call center conversations between native Punjabi speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.
The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).
These real-world interactions help build speech models that understand healthcare domain nuances and user intent.
Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.
Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.
This dataset can be used across a range of healthcare and voice AI use cases:
PAARI Punjabi TTS Dataset
Dataset Description
This dataset contains TTS-optimized chunks of journalism articles in Punjabi (ਪੰਜਾਬੀ) from the People's Archive of Rural India (PAARI). The articles focus on rural life, agriculture, social issues, and cultural stories from rural India.
Dataset Details
Language: Punjabi (ਪੰਜਾਬੀ) Script: Gurmukhi Language Code: pa Dataset Type: TTS-optimized Source: Rural India Online License: Please refer to PAARI's terms of use… See the full description on the dataset page: https://huggingface.co/datasets/keplersystems/PAARI-Punjabi-TTS.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Kumba Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, uye Podcast Dataset yeAI & Speech Models Bata Isu Call-Center Data General Conversation Data Podcast Data Call-Center Data .elementor-58312 .elementor-element.elementor-element-91938a9{padding:20px 0px…
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Domum Punjabi Datasetਪੰਜਾਬੀ High-Quality Punjabi Call-Center, Conversatio Generalis, et Podcast Dataset pro AI & Exemplaria Oratione Contact Us Call-Center Data Colloquium Generale Data Podcast Data Call-Center Data .elementor-58312 .elementor-elementor-elementum-91938a9{padding:20px 0px…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Punjabi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Punjabi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Punjabi conversations covering a broad spectrum of everyday topics.
This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Punjabi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Chats reflect informal, native-level Punjabi usage with:
Every chat instance is accompanied by structured metadata, which includes:
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
This ensures a clean, reliable dataset ready for high-performance AI model training.
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Punjabi Telecom Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between telecom customers and call center agents. This dataset captures real-world service interactions and domain-specific language in Punjabi, enabling the development of intelligent conversational AI and NLP systems tailored for the telecommunications sector.Participant & Chat Overview
This dataset spans a wide range of telecom customer service scenarios:
The conversations reflect real-life telecom interactions in Punjabi, incorporating:
Conversations follow the natural flow of telecom customer service exchanges, including:
PAARI Punjabi Dataset
Dataset Description
This dataset contains journalism articles in Punjabi (ਪੰਜਾਬੀ) from the People's Archive of Rural India (PAARI). The articles focus on rural life, agriculture, social issues, and cultural stories from rural India.
Dataset Details
Language: Punjabi (ਪੰਜਾਬੀ) Script: Gurmukhi Language Code: pa Dataset Type: Standard Source: Rural India Online License: Please refer to PAARI's terms of use
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/keplersystems/PAARI-Punjabi.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Punjabi Travel Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-life travel and tourism interactions, this dataset captures the language, tone, and service dynamics essential for building robust conversational AI, chatbots, and NLP solutions for the travel industry in Punjabi-speaking markets.
The dataset encompasses a wide range of travel and tourism use cases across both customer-initiated and agent-initiated conversations:
This variety ensures wide applicability in both sales enablement and customer support automation.
Conversations are crafted to reflect the everyday language and nuances of Punjabi-speaking travelers:
These linguistic and cultural cues enable the development of context-aware, natural-sounding AI systems.
The dataset captures a variety of interaction types, including:
rupindersingh1313/30_8_2025_dataset
Dataset Description
This dataset contains Punjabi OCR data with page images and their corresponding text annotations, ready for machine learning applications.
Dataset Summary
Language: Punjabi (pa-IN) Script: Gurmukhi Total Pages: 769 Source: Generated using Punjabi OCR annotation pipeline Format: Image-annotation pairs with original JSON annotations
Dataset Splits
Train: 615 samples Validation: 76 samples Test:… See the full description on the dataset page: https://huggingface.co/datasets/rupindersingh1313/30_8_2025_dataset.
Punjabi Transliteration Corpus (PTC)
The Punjabi Transliteration Corpus (PTC) is a comprehensive dataset containing 6.3 million parallel sentences in Gurmukhi and Shahmukhi scripts. This corpus has been meticulously compiled to support the development and evaluation of neural machine transliteration (NMT) models for Punjabi text.
Corpus Details
Total Sentences: 6.3 million Domains Covered: Various domains including CCaligned, ccmatrix, TED, QED, OPUS, TIco, Wikimedia… See the full description on the dataset page: https://huggingface.co/datasets/SLPG/Punjabi_Transliteration_Corpus.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Off-the-shelf Punjabi Audio Dataset - Total Volume 200 hrs, Bifurcated into 8khz Unscripted, synthetic telephonic Call Center conversation: 'agent' & 'customer' 60 hrs, 8khz Unscripted telephonic generic conversation between two people 100 hrs, 16 khz Public domain Media & Podcasts audio/video coversations 40 hrs. Topics include Agriculture, Art, Aviation, Banking, Consumer, Crime, Culture, Delivery, Entertainment, Finance, Food, Gaming, Health, Hospitality, IT, Insurance, Legal, News, Oil, Politics, Real Estate, Religion, Retail, Spirituality, Sports, Technology, Telecom, Travel, Weather, Automotive. Audio Format .wav, Transcription Format .json.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Punjabi Delivery & Logistics Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-world delivery and logistics interactions, this dataset captures the language, tone, and service patterns essential for developing robust Punjabi-language conversational AI, chatbots, and NLP systems across the delivery ecosystem.
The dataset spans a wide range of delivery and logistics scenarios, ensuring strong coverage across customer service and operational workflows.
This topical spread ensures wide applicability in both customer support automation and logistics optimization use cases.
The conversations reflect the authentic language and interaction style of Punjabi-speaking customers and delivery agents, incorporating:
This linguistic realism enables the development of context-aware and naturally responsive AI systems.
The dataset captures a diverse range of interaction types and delivery workflows:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Laman Utama Set Data Punjabiਪੰਜਾਬੀ ਡਾਟਾਸੈਟPusat Panggilan Punjabi Berkualiti Tinggi, Perbualan Umum dan Set Data Podcast untuk Model AI & Pertuturan Hubungi Kami Data Pusat Panggilan Data Perbualan Umum Podcast 58312atau91938Pusat Data-Pusat Panggilan 9. .elementor-element.elementor-element-20a0{padding:XNUMXpx XNUMXpx…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Punjabi Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Punjabi language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Punjabi, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Punjabi TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Punjabi voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.
Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.
All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.
Only clean, production-grade audio makes it into the final dataset.
All voice artists are native Punjabi speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.
Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.
While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Punjabi Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Punjabi -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
The dataset features 30 hours of dual-channel call center recordings between native Punjabi speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
Such domain-rich variety ensures model generalization across common real estate support conversations.
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
These transcriptions streamline ASR and NLP development for Punjabi real estate voice applications.
Detailed metadata accompanies each participant and conversation:
This enables smart filtering, dialect-focused model training, and structured dataset exploration.
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Punjabi Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Punjabi -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
The dataset includes 30 hours of dual-channel audio recordings between native Punjabi speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
These scenarios help models understand and respond to diverse traveler needs in real-time.
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
Extensive metadata enriches each call and speaker for better filtering and AI training:
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਾਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, ug Podcast Dataset para sa AI ug Speech Models Kontaka Kami Call-Center Data General Conversation Data Podcast Data Call-Center-58312 Data Call-Center-91938 .elementor-element.elementor-element-9a20{padding:0px XNUMXpx…