32 datasets found

s
Punjabi Dataset
ceb.shaip.com
Updated Aug 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Punjabi Dataset [Dataset]. https://ceb.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
Explore at:
Dataset updated
Aug 22, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਾਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, ug Podcast Dataset para sa AI ug Speech Models Kontaka Kami Call-Center Data General Conversation Data Podcast Data Call-Center-58312 Data Call-Center-91938 .elementor-element.elementor-element-9a20{padding:0px XNUMXpx…
F
Punjabi Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Punjabi Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Punjabi-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Punjabi healthcare communication and includes:
•
Authentic Naming Patterns: Punjabi personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Punjabi formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Punjabi-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
s
Punjabi Dataset
so.shaip.com
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Punjabi Dataset [Dataset]. https://so.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
Explore at:
Dataset updated
Feb 12, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Hoyga Punjabi Dataset. .elementor-element.elementor-element-58312a91938{padding:9px 20px…
F
Punjabi Call Center Data for Healthcare AI
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-punjabi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This Punjabi Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Punjabi speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.
Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.
Speech Data
The dataset features 30 Hours of dual-channel call center conversations between native Punjabi speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.
•Participant Diversity:
•
Speakers: 60 verified native Punjabi speakers from our contributor community.

•
Regions: Diverse regions across Punjab to ensure broad dialectal representation.

•
Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.

•RecordingDetails:
•
Conversation Nature: Naturally flowing, unscripted conversations.

•
Call Duration: Each session ranges between 5 to 15 minutes.

•
Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.

•
Recording Environment: Captured in clear conditions without background noise or echo.

Topic Diversity
The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).
•Inbound Calls:
•Appointment Scheduling
•New Patient Registration
•Surgical Consultation
•Dietary Advice and Consultations
•Insurance Coverage Inquiries
•Follow-up Treatment Requests, and more
•OutboundCalls:
•Appointment Reminders
•Preventive Care Campaigns
•Test Results & Lab Reports
•Health Risk Assessment Calls
•Vaccination Updates
•Wellness Subscription Outreach, and more
These real-world interactions help build speech models that understand healthcare domain nuances and user intent.
Transcription
Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.
•Transcription Includes:
•Speaker-identified Dialogues
•Time-coded Segments
•Non-speech Annotations (e.g., silence, cough)
•High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.
Metadata
Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.
•
Participant Metadata: ID, gender, age, region, accent, and dialect.

•
Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

Usage and Applications
This dataset can be used across a range of healthcare and voice AI use cases:
•
<b style="font-weight:
h
PAARI-Punjabi-TTS
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kepler Systems (2025). PAARI-Punjabi-TTS [Dataset]. https://huggingface.co/datasets/keplersystems/PAARI-Punjabi-TTS
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Kepler Systems
Description
PAARI Punjabi TTS Dataset

Dataset Description

This dataset contains TTS-optimized chunks of journalism articles in Punjabi (ਪੰਜਾਬੀ) from the People's Archive of Rural India (PAARI). The articles focus on rural life, agriculture, social issues, and cultural stories from rural India.

Dataset Details

Language: Punjabi (ਪੰਜਾਬੀ) Script: Gurmukhi Language Code: pa Dataset Type: TTS-optimized Source: Rural India Online License: Please refer to PAARI's terms of use… See the full description on the dataset page: https://huggingface.co/datasets/keplersystems/PAARI-Punjabi-TTS.
s
Vadivelu Comedy Punjabi Dataset
sn.shaip.com
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Vadivelu Comedy Punjabi Dataset [Dataset]. https://sn.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
Explore at:
Dataset updated
Sep 11, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Kumba Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, uye Podcast Dataset yeAI & Speech Models Bata Isu Call-Center Data General Conversation Data Podcast Data Call-Center Data .elementor-58312 .elementor-element.elementor-element-91938a9{padding:20px 0px…
s
Punjabi Dataset
la.shaip.com
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Punjabi Dataset [Dataset]. https://la.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
Explore at:
Dataset updated
Dec 8, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Domum Punjabi Datasetਪੰਜਾਬੀ High-Quality Punjabi Call-Center, Conversatio Generalis, et Podcast Dataset pro AI & Exemplaria Oratione Contact Us Call-Center Data Colloquium Generale Data Podcast Data Call-Center Data .elementor-58312 .elementor-elementor-elementum-91938a9{padding:20px 0px…
F
Punjabi Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Punjabi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Punjabi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Punjabi conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Punjabi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 150 native Punjabi speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level Punjabi usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
F
Punjabi Agent-Customer Chat Dataset for Telecom
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Telecom [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-telecom-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Punjabi Telecom Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between telecom customers and call center agents. This dataset captures real-world service interactions and domain-specific language in Punjabi, enabling the development of intelligent conversational AI and NLP systems tailored for the telecommunications sector.Participant & Chat Overview
•
Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: A mix of positive, neutral, and negative interactions

Topic Diversity
This dataset spans a wide range of telecom customer service scenarios:
•Inbound Chats (Customer-Initiated)
•Phone number porting
•Network connectivity issues
•Billing inquiries and adjustments
•Technical support requests
•Service activations and upgrades
•International roaming inquiries
•Refunds and complaint resolution
•Emergency service access
•Outbound Chats (Agent-Initiated)
•Welcome and onboarding calls
•Payment reminders and due alerts
•Customer satisfaction surveys
•Technical issue follow-ups
•Usage reviews and service feedback
•Promotions and service offers
Language Nuance & Realism
The conversations reflect real-life telecom interactions in Punjabi, incorporating:
•
Naming Patterns: Realistic Punjabi personal, business, and telecom brand names

•
Localized Content: Phone numbers, email addresses, and locations consistent with regional norms

•
Time & Number Formats: Punjabi representations of dates, times, currencies, and service numbers

•
Informal Language & Slang: Common Punjabi expressions, idioms, and conversational shortcuts found in telecom discussions

Conversational Flow & Structure
Conversations follow the natural flow of telecom customer service exchanges, including:
•Dialogue Types:
•Simple service inquiries
•Detailed problem-solving discussions
•Plan explanations and upgrades
•Feedback collection and status updates
•Interaction Stages:
•Initial greetings and verification
•Data or issue collection
•Clarification and troubleshooting
<span
h
PAARI-Punjabi
huggingface.co
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kepler Systems (2025). PAARI-Punjabi [Dataset]. https://huggingface.co/datasets/keplersystems/PAARI-Punjabi
Explore at:
Dataset updated
Aug 2, 2025
Dataset authored and provided by
Kepler Systems
Description
PAARI Punjabi Dataset

Dataset Description

This dataset contains journalism articles in Punjabi (ਪੰਜਾਬੀ) from the People's Archive of Rural India (PAARI). The articles focus on rural life, agriculture, social issues, and cultural stories from rural India.

Dataset Details

Language: Punjabi (ਪੰਜਾਬੀ) Script: Gurmukhi Language Code: pa Dataset Type: Standard Source: Rural India Online License: Please refer to PAARI's terms of use

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/keplersystems/PAARI-Punjabi.
F
Punjabi Agent-Customer Chat Dataset for Travel
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Travel [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-travel-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Punjabi Travel Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-life travel and tourism interactions, this dataset captures the language, tone, and service dynamics essential for building robust conversational AI, chatbots, and NLP solutions for the travel industry in Punjabi-speaking markets.
Participant & Chat Overview
•
Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

Topic Diversity
The dataset encompasses a wide range of travel and tourism use cases across both customer-initiated and agent-initiated conversations:
•Inbound Chats (Customer-Initiated)
•Booking assistance and travel planning
•Destination information and recommendations
•Flight delays or cancellations
•Lost or delayed baggage support
•Assistance for travelers with disabilities
•Health and safety travel inquiries
•Outbound Chats (Agent-Initiated)
•Promotional offers and travel package deals
•Booking confirmations and schedule updates
•Flight change notifications
•Customer satisfaction surveys
•Visa expiration and renewal reminders
•Loyalty and feedback collection campaigns
This variety ensures wide applicability in both sales enablement and customer support automation.
Language Diversity & Realism
Conversations are crafted to reflect the everyday language and nuances of Punjabi-speaking travelers:
•
Naming Patterns: Punjabi personal names, airline and hotel names, tour operators

•
Localized Details: Regional email formats, phone numbers, locations, and cultural references

•
Time and Currency Expressions: Dates, local times, and prices represented in Punjabi forms

•
Slang and Informal Speech: Common phrases and idioms used in travel planning and customer support

These linguistic and cultural cues enable the development of context-aware, natural-sounding AI systems.
Conversational Structure & Flow
The dataset captures a variety of interaction types, including:
•Dialogue Types:
•Quick inquiries and confirmations
•Complex issue resolution
•Advisory and planning sessions
•Travel disruption and recovery support
•Common Flow Elements:
•Greetings and authentication
•Information request and validation
•Problem or request resolution
<div style="margin-left: 60px; font-weight: 300; display: flex; gap: 16px; align-items:
h
30_8_2025_dataset
huggingface.co
Updated Aug 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupinder Singh (2025). 30_8_2025_dataset [Dataset]. https://huggingface.co/datasets/rupindersingh1313/30_8_2025_dataset
Explore at:
Dataset updated
Aug 30, 2025
Authors
Rupinder Singh
Description
rupindersingh1313/30_8_2025_dataset

Dataset Description

This dataset contains Punjabi OCR data with page images and their corresponding text annotations, ready for machine learning applications.

Dataset Summary

Language: Punjabi (pa-IN) Script: Gurmukhi Total Pages: 769 Source: Generated using Punjabi OCR annotation pipeline Format: Image-annotation pairs with original JSON annotations

Dataset Splits

Train: 615 samples Validation: 76 samples Test:… See the full description on the dataset page: https://huggingface.co/datasets/rupindersingh1313/30_8_2025_dataset.
h
Punjabi_Transliteration_Corpus
huggingface.co
Updated Jul 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Speech and Language Processing Group (2024). Punjabi_Transliteration_Corpus [Dataset]. https://huggingface.co/datasets/SLPG/Punjabi_Transliteration_Corpus
Explore at:
Dataset updated
Jul 21, 2024
Authors
Speech and Language Processing Group
Description
Punjabi Transliteration Corpus (PTC)

The Punjabi Transliteration Corpus (PTC) is a comprehensive dataset containing 6.3 million parallel sentences in Gurmukhi and Shahmukhi scripts. This corpus has been meticulously compiled to support the development and evaluation of neural machine transliteration (NMT) models for Punjabi text.

Corpus Details

Total Sentences: 6.3 million Domains Covered: Various domains including CCaligned, ccmatrix, TED, QED, OPUS, TIco, Wikimedia… See the full description on the dataset page: https://huggingface.co/datasets/SLPG/Punjabi_Transliteration_Corpus.
s
Punjabi Off-the-Shelf Datasets
sw.shaip.com
bn.shaip.com
+3more
json
Updated Jan 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2022). Punjabi Off-the-Shelf Datasets [Dataset]. https://sw.shaip.com/offerings/speech-data-catalog/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 31, 2022
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Off-the-shelf Punjabi Audio Dataset - Total Volume 200 hrs, Bifurcated into 8khz Unscripted, synthetic telephonic Call Center conversation: 'agent' & 'customer' 60 hrs, 8khz Unscripted telephonic generic conversation between two people 100 hrs, 16 khz Public domain Media & Podcasts audio/video coversations 40 hrs. Topics include Agriculture, Art, Aviation, Banking, Consumer, Crime, Culture, Delivery, Entertainment, Finance, Food, Gaming, Health, Hospitality, IT, Insurance, Legal, News, Oil, Politics, Real Estate, Religion, Retail, Spirituality, Sports, Technology, Telecom, Travel, Weather, Automotive. Audio Format .wav, Transcription Format .json.
F
Punjabi Agent-Customer Chat Dataset for Delivery & Logistics
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Delivery & Logistics [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-delivery-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Punjabi Delivery & Logistics Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-world delivery and logistics interactions, this dataset captures the language, tone, and service patterns essential for developing robust Punjabi-language conversational AI, chatbots, and NLP systems across the delivery ecosystem.
Participant & Chat Overview
•
Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns between customer and agent

•
Chat Types: Inbound (customer-initiated) and outbound (agent-initiated)

•
Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

Topic Diversity
The dataset spans a wide range of delivery and logistics scenarios, ensuring strong coverage across customer service and operational workflows.
•Inbound Chats (Customer-Initiated)
•Order tracking and delivery status inquiries
•Complaints about late or missing deliveries
•Undeliverable or incorrect address resolution
•Return process and pickup scheduling
•Order modifications and change requests
•Enquiries about delivery method options
•Outbound Chats (Agent-Initiated)
•Delivery confirmations and dispatch updates
•Subscription renewal or delivery reminders
•Notification of delivery issues or missed attempts
•Out-of-stock or product unavailability alerts
•Satisfaction surveys and service feedback collection
•Address verification for upcoming deliveries
This topical spread ensures wide applicability in both customer support automation and logistics optimization use cases.
Language Diversity & Realism
The conversations reflect the authentic language and interaction style of Punjabi-speaking customers and delivery agents, incorporating:
•
Naming Patterns: Personal names, business names, and logistics company references

•
Localized Details: Punjabi-format emails, phone numbers, regional addresses, and delivery zones

•
Temporal and Numeric Expressions: Dates, delivery windows, prices, and tracking IDs in Punjabi formats

•
Slang and Informal Speech: Everyday expressions and delivery-specific idioms used across Punjabi dialects

This linguistic realism enables the development of context-aware and naturally responsive AI systems.
Conversational Structure & Flow
The dataset captures a diverse range of interaction types and delivery workflows:
•Dialogue Types:
•Quick status checks and confirmations
•Multi-turn issue resolution
•Process walkthroughs and guidance
•Feedback and escalation handling
•Common Flow Elements:
•Greetings and caller verification
•Request or complaint initiation
<div style="margin-left: 60px;
s
Set Data Punjabi
ms.shaip.com
Updated Aug 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Set Data Punjabi [Dataset]. https://ms.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
Explore at:
Dataset updated
Aug 15, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Laman Utama Set Data Punjabiਪੰਜਾਬੀ ਡਾਟਾਸੈਟPusat Panggilan Punjabi Berkualiti Tinggi, Perbualan Umum dan Set Data Podcast untuk Model AI & Pertuturan Hubungi Kami Data Pusat Panggilan Data Perbualan Umum Podcast 58312atau91938Pusat Data-Pusat Panggilan 9. .elementor-element.elementor-element-20a0{padding:XNUMXpx XNUMXpx…
F
Punjabi Scripted Monologue Speech Data for Healthcare
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-punjabi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Introducing the Punjabi Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Punjabi language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Punjabi, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers: 60 native Punjabi speakers.

•
Regional Balance: Participants are sourced from multiple regions across Punjab, reflecting diverse dialects and linguistic traits.

•
Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications
•
Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•
Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•
Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•
Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names: Gender- and region-appropriate Punjab names

•
Addresses: Varied local address formats spoken naturally

•
Dates & Times: References to appointment dates, times, follow-ups, and schedules

•
Medical Terminology: Common medical procedures, symptoms, and treatment references

•
Numbers & Measurements: Health data like dosages, vitals, and test result values

•
Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•
Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•
<b style="font-weight:
F
Punjabi TTS Speech Dataset for Speech Synthesis
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-punjabi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Punjabi TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Punjabi voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.
Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.
All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.
Recording & Audio Quality
•
Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth

•
SNR: Minimum 30 dB

•
Channel: Mono

•
Recording Duration: 20-30 minutes

•
Recording Environment: Studio-controlled, acoustically treated rooms

•
Per Speaker Volume: 1–2 hours of speech per artist

•
Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

Only clean, production-grade audio makes it into the final dataset.
Voice Artist Selection
All voice artists are native Punjabi speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.
•Artist Profile:
•Gender: Male and Female
•Age Range: 20–60 years
•Regions: Native Punjabi-speaking states from Punjab
•
Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

Script Quality & Coverage
Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.
•
Word Count per Script: 3,000–5,000 words per 30-minute session

•Content Types:
•Storytelling
•Script and book reading
•Informational explainers
•Government service instructions
•E-commerce tutorials
•Motivational content
•Health & wellness guides
•Education & career advice
•
Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

Transcripts & Alignment
While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.
•
Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery

•
Format: Available in plain text and JSON

•Post-processing:
•Corrected for
F
Punjabi Call Center Data for Realestate AI
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-punjabi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This Punjabi Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Punjabi -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
Speech Data
The dataset features 30 hours of dual-channel call center recordings between native Punjabi speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
•Participant Diversity:
•
Speakers: 60 native Punjabi speakers from our verified contributor community.

•
Regions: Representing different regions across Punjab to ensure accent and dialect variation.

•
Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted agent-customer discussions.

•
Call Duration: Average 5–15 minutes per call.

•
Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.

•
Recording Environment: Captured in noise-free and echo-free conditions.

Topic Diversity
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
•Inbound Calls:
•Property Inquiries
•Rental Availability
•Renovation Consultation
•Property Features & Amenities
•Investment Property Evaluation
•Ownership History & Legal Info, and more
•Outbound Calls:
•New Listing Notifications
•Post-Purchase Follow-ups
•Property Recommendations
•Value Updates
•Customer Satisfaction Surveys, and others
Such domain-rich variety ensures model generalization across common real estate support conversations.
Transcription
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., background noise, pauses)
•High transcription accuracy with word error rate below 5% via dual-layer human review.
These transcriptions streamline ASR and NLP development for Punjabi real estate voice applications.
Metadata
Detailed metadata accompanies each participant and conversation:
•
Participant Metadata: ID, age, gender, location, accent, and dialect.

•
Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

This enables smart filtering, dialect-focused model training, and structured dataset exploration.
Usage and Applications
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
F
Punjabi Call Center Data for Travel AI
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-punjabi-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This Punjabi Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Punjabi -speaking travelers.
Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.
Speech Data
The dataset includes 30 hours of dual-channel audio recordings between native Punjabi speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.
•Participant Diversity:
•
Speakers: 60 native Punjabi contributors from our verified pool.

•
Regions: Covering multiple Punjab regions to capture accent and dialectal variation.

•
Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).

•Recording Details:
•
Conversation Nature: Naturally flowing, spontaneous customer-agent calls.

•
Call Duration: Between 5 and 15 minutes per session.

•
Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.

•
Recording Environment: Captured in controlled, noise-free, echo-free settings.

Topic Diversity
Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).
•Inbound Calls:
•Booking Assistance
•Destination Information
•Flight Delays or Cancellations
•Support for Disabled Passengers
•Health and Safety Travel Inquiries
•Lost or Delayed Luggage, and more
•Outbound Calls:
•Promotional Travel Offers
•Customer Feedback Surveys
•Booking Confirmations
•Flight Rescheduling Alerts
•Visa Expiry Notifications, and others
These scenarios help models understand and respond to diverse traveler needs in real-time.
Transcription
Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-Stamped Segments
•Non-speech Markers (e.g., pauses, coughs)
•High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.
Metadata
Extensive metadata enriches each call and speaker for better filtering and AI training:
•
Participant Metadata: ID, age, gender, region, accent, and dialect.

•
Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

Usage and Applications
This dataset is ideal for a variety of AI use cases in the travel and tourism space:
•
ASR Systems: Train Punjabi speech-to-text engines for travel platforms.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;

Facebook

Twitter

Click to copy link

Link copied

Cite

Shaip (2024). Punjabi Dataset [Dataset]. https://ceb.shaip.com/offerings/speech-data-catalog/punjabi-dataset/

Punjabi Dataset

Explore at:

Dataset updated

Aug 22, 2024

Dataset authored and provided by

Shaip

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Home Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਾਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, ug Podcast Dataset para sa AI ug Speech Models Kontaka Kami Call-Center Data General Conversation Data Podcast Data Call-Center-58312 Data Call-Center-91938 .elementor-element.elementor-element-9a20{padding:0px XNUMXpx…

Clear search

Close search

Google apps

Main menu

Punjabi Dataset

Punjabi Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Punjabi Dataset

Punjabi Call Center Data for Healthcare AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

PAARI-Punjabi-TTS

Vadivelu Comedy Punjabi Dataset

Punjabi Dataset

Punjabi Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

Punjabi Agent-Customer Chat Dataset for Telecom

Introduction

Topic Diversity

Language Nuance & Realism

Conversational Flow & Structure

PAARI-Punjabi

Punjabi Agent-Customer Chat Dataset for Travel

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Structure & Flow

30_8_2025_dataset

Punjabi_Transliteration_Corpus

Punjabi Off-the-Shelf Datasets

Punjabi Agent-Customer Chat Dataset for Delivery & Logistics

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Structure & Flow

Set Data Punjabi

Punjabi Scripted Monologue Speech Data for Healthcare

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

Punjabi TTS Speech Dataset for Speech Synthesis

Recording & Audio Quality

Voice Artist Selection

Script Quality & Coverage

Transcripts & Alignment

Punjabi Call Center Data for Realestate AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Punjabi Call Center Data for Travel AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Punjabi DatasetSee More Versions

Punjabi Dataset