32 datasets found
  1. s

    Punjabi Dataset

    • ceb.shaip.com
    Updated Aug 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Punjabi Dataset [Dataset]. https://ceb.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
    Explore at:
    Dataset updated
    Aug 22, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਾਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, ug Podcast Dataset para sa AI ug Speech Models Kontaka Kami Call-Center Data General Conversation Data Podcast Data Call-Center-58312 Data Call-Center-91938 .elementor-element.elementor-element-9a20{padding:0px XNUMXpx…

  2. F

    Punjabi Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Punjabi-speaking regions.

    Participant & Chat Overview

    Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Punjabi healthcare communication and includes:

    Authentic Naming Patterns: Punjabi personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Punjabi formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Punjabi-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines

    Applications

    <p

  3. s

    Punjabi Dataset

    • so.shaip.com
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Punjabi Dataset [Dataset]. https://so.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
    Explore at:
    Dataset updated
    Feb 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Hoyga Punjabi Dataset. .elementor-element.elementor-element-58312a91938{padding:9px 20px…

  4. F

    Punjabi Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Punjabi Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Punjabi speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Punjabi speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Punjabi speakers from our contributor community.
    Regions: Diverse regions across Punjab to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

    <b style="font-weight:

  5. h

    PAARI-Punjabi-TTS

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kepler Systems (2025). PAARI-Punjabi-TTS [Dataset]. https://huggingface.co/datasets/keplersystems/PAARI-Punjabi-TTS
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Kepler Systems
    Description

    PAARI Punjabi TTS Dataset

      Dataset Description
    

    This dataset contains TTS-optimized chunks of journalism articles in Punjabi (ਪੰਜਾਬੀ) from the People's Archive of Rural India (PAARI). The articles focus on rural life, agriculture, social issues, and cultural stories from rural India.

      Dataset Details
    

    Language: Punjabi (ਪੰਜਾਬੀ) Script: Gurmukhi Language Code: pa Dataset Type: TTS-optimized Source: Rural India Online License: Please refer to PAARI's terms of use… See the full description on the dataset page: https://huggingface.co/datasets/keplersystems/PAARI-Punjabi-TTS.

  6. s

    Vadivelu Comedy Punjabi Dataset

    • sn.shaip.com
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Vadivelu Comedy Punjabi Dataset [Dataset]. https://sn.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Kumba Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, uye Podcast Dataset yeAI & Speech Models Bata Isu Call-Center Data General Conversation Data Podcast Data Call-Center Data .elementor-58312 .elementor-element.elementor-element-91938a9{padding:20px 0px…

  7. s

    Punjabi Dataset

    • la.shaip.com
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Punjabi Dataset [Dataset]. https://la.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
    Explore at:
    Dataset updated
    Dec 8, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Domum Punjabi Datasetਪੰਜਾਬੀ High-Quality Punjabi Call-Center, Conversatio Generalis, et Podcast Dataset pro AI & Exemplaria Oratione Contact Us Call-Center Data Colloquium Generale Data Podcast Data Call-Center Data .elementor-58312 .elementor-elementor-elementum-91938a9{padding:20px 0px…

  8. F

    Punjabi Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Punjabi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Punjabi conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Punjabi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 150 native Punjabi speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Punjabi usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  9. F

    Punjabi Agent-Customer Chat Dataset for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Telecom [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi Telecom Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between telecom customers and call center agents. This dataset captures real-world service interactions and domain-specific language in Punjabi, enabling the development of intelligent conversational AI and NLP systems tailored for the telecommunications sector.Participant & Chat Overview

    Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: A mix of positive, neutral, and negative interactions

    Topic Diversity

    This dataset spans a wide range of telecom customer service scenarios:

    Inbound Chats (Customer-Initiated)
    Phone number porting
    Network connectivity issues
    Billing inquiries and adjustments
    Technical support requests
    Service activations and upgrades
    International roaming inquiries
    Refunds and complaint resolution
    Emergency service access
    Outbound Chats (Agent-Initiated)
    Welcome and onboarding calls
    Payment reminders and due alerts
    Customer satisfaction surveys
    Technical issue follow-ups
    Usage reviews and service feedback
    Promotions and service offers

    Language Nuance & Realism

    The conversations reflect real-life telecom interactions in Punjabi, incorporating:

    Naming Patterns: Realistic Punjabi personal, business, and telecom brand names
    Localized Content: Phone numbers, email addresses, and locations consistent with regional norms
    Time & Number Formats: Punjabi representations of dates, times, currencies, and service numbers
    Informal Language & Slang: Common Punjabi expressions, idioms, and conversational shortcuts found in telecom discussions

    Conversational Flow & Structure

    Conversations follow the natural flow of telecom customer service exchanges, including:

    Dialogue Types:
    Simple service inquiries
    Detailed problem-solving discussions
    Plan explanations and upgrades
    Feedback collection and status updates
    Interaction Stages:
    Initial greetings and verification
    Data or issue collection
    Clarification and troubleshooting
    <span

  10. h

    PAARI-Punjabi

    • huggingface.co
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kepler Systems (2025). PAARI-Punjabi [Dataset]. https://huggingface.co/datasets/keplersystems/PAARI-Punjabi
    Explore at:
    Dataset updated
    Aug 2, 2025
    Dataset authored and provided by
    Kepler Systems
    Description

    PAARI Punjabi Dataset

      Dataset Description
    

    This dataset contains journalism articles in Punjabi (ਪੰਜਾਬੀ) from the People's Archive of Rural India (PAARI). The articles focus on rural life, agriculture, social issues, and cultural stories from rural India.

      Dataset Details
    

    Language: Punjabi (ਪੰਜਾਬੀ) Script: Gurmukhi Language Code: pa Dataset Type: Standard Source: Rural India Online License: Please refer to PAARI's terms of use

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/keplersystems/PAARI-Punjabi.
    
  11. F

    Punjabi Agent-Customer Chat Dataset for Travel

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Travel [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi Travel Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-life travel and tourism interactions, this dataset captures the language, tone, and service dynamics essential for building robust conversational AI, chatbots, and NLP solutions for the travel industry in Punjabi-speaking markets.

    Participant & Chat Overview

    Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

    Topic Diversity

    The dataset encompasses a wide range of travel and tourism use cases across both customer-initiated and agent-initiated conversations:

    Inbound Chats (Customer-Initiated)
    Booking assistance and travel planning
    Destination information and recommendations
    Flight delays or cancellations
    Lost or delayed baggage support
    Assistance for travelers with disabilities
    Health and safety travel inquiries
    Outbound Chats (Agent-Initiated)
    Promotional offers and travel package deals
    Booking confirmations and schedule updates
    Flight change notifications
    Customer satisfaction surveys
    Visa expiration and renewal reminders
    Loyalty and feedback collection campaigns

    This variety ensures wide applicability in both sales enablement and customer support automation.

    Language Diversity & Realism

    Conversations are crafted to reflect the everyday language and nuances of Punjabi-speaking travelers:

    Naming Patterns: Punjabi personal names, airline and hotel names, tour operators
    Localized Details: Regional email formats, phone numbers, locations, and cultural references
    Time and Currency Expressions: Dates, local times, and prices represented in Punjabi forms
    Slang and Informal Speech: Common phrases and idioms used in travel planning and customer support

    These linguistic and cultural cues enable the development of context-aware, natural-sounding AI systems.

    Conversational Structure & Flow

    The dataset captures a variety of interaction types, including:

    Dialogue Types:
    Quick inquiries and confirmations
    Complex issue resolution
    Advisory and planning sessions
    Travel disruption and recovery support
    Common Flow Elements:
    Greetings and authentication
    Information request and validation
    Problem or request resolution
    <div style="margin-left: 60px; font-weight: 300; display: flex; gap: 16px; align-items:

  12. h

    30_8_2025_dataset

    • huggingface.co
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupinder Singh (2025). 30_8_2025_dataset [Dataset]. https://huggingface.co/datasets/rupindersingh1313/30_8_2025_dataset
    Explore at:
    Dataset updated
    Aug 30, 2025
    Authors
    Rupinder Singh
    Description

    rupindersingh1313/30_8_2025_dataset

      Dataset Description
    

    This dataset contains Punjabi OCR data with page images and their corresponding text annotations, ready for machine learning applications.

      Dataset Summary
    

    Language: Punjabi (pa-IN) Script: Gurmukhi Total Pages: 769 Source: Generated using Punjabi OCR annotation pipeline Format: Image-annotation pairs with original JSON annotations

      Dataset Splits
    

    Train: 615 samples Validation: 76 samples Test:… See the full description on the dataset page: https://huggingface.co/datasets/rupindersingh1313/30_8_2025_dataset.

  13. h

    Punjabi_Transliteration_Corpus

    • huggingface.co
    Updated Jul 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Speech and Language Processing Group (2024). Punjabi_Transliteration_Corpus [Dataset]. https://huggingface.co/datasets/SLPG/Punjabi_Transliteration_Corpus
    Explore at:
    Dataset updated
    Jul 21, 2024
    Authors
    Speech and Language Processing Group
    Description

    Punjabi Transliteration Corpus (PTC)

    The Punjabi Transliteration Corpus (PTC) is a comprehensive dataset containing 6.3 million parallel sentences in Gurmukhi and Shahmukhi scripts. This corpus has been meticulously compiled to support the development and evaluation of neural machine transliteration (NMT) models for Punjabi text.

      Corpus Details
    

    Total Sentences: 6.3 million Domains Covered: Various domains including CCaligned, ccmatrix, TED, QED, OPUS, TIco, Wikimedia… See the full description on the dataset page: https://huggingface.co/datasets/SLPG/Punjabi_Transliteration_Corpus.

  14. s

    Punjabi Off-the-Shelf Datasets

    • sw.shaip.com
    • bn.shaip.com
    • +3more
    json
    Updated Jan 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2022). Punjabi Off-the-Shelf Datasets [Dataset]. https://sw.shaip.com/offerings/speech-data-catalog/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 31, 2022
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Off-the-shelf Punjabi Audio Dataset - Total Volume 200 hrs, Bifurcated into 8khz Unscripted, synthetic telephonic Call Center conversation: 'agent' & 'customer' 60 hrs, 8khz Unscripted telephonic generic conversation between two people 100 hrs, 16 khz Public domain Media & Podcasts audio/video coversations 40 hrs. Topics include Agriculture, Art, Aviation, Banking, Consumer, Crime, Culture, Delivery, Entertainment, Finance, Food, Gaming, Health, Hospitality, IT, Insurance, Legal, News, Oil, Politics, Real Estate, Religion, Retail, Spirituality, Sports, Technology, Telecom, Travel, Weather, Automotive. Audio Format .wav, Transcription Format .json.

  15. F

    Punjabi Agent-Customer Chat Dataset for Delivery & Logistics

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Agent-Customer Chat Dataset for Delivery & Logistics [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-delivery-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi Delivery & Logistics Chat Dataset is a comprehensive collection of over 12,000 text-based conversations between customers and call center agents. Focused on real-world delivery and logistics interactions, this dataset captures the language, tone, and service patterns essential for developing robust Punjabi-language conversational AI, chatbots, and NLP systems across the delivery ecosystem.

    Participant & Chat Overview

    Participants: 200+ native Punjabi speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns between customer and agent
    Chat Types: Inbound (customer-initiated) and outbound (agent-initiated)
    Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

    Topic Diversity

    The dataset spans a wide range of delivery and logistics scenarios, ensuring strong coverage across customer service and operational workflows.

    Inbound Chats (Customer-Initiated)
    Order tracking and delivery status inquiries
    Complaints about late or missing deliveries
    Undeliverable or incorrect address resolution
    Return process and pickup scheduling
    Order modifications and change requests
    Enquiries about delivery method options
    Outbound Chats (Agent-Initiated)
    Delivery confirmations and dispatch updates
    Subscription renewal or delivery reminders
    Notification of delivery issues or missed attempts
    Out-of-stock or product unavailability alerts
    Satisfaction surveys and service feedback collection
    Address verification for upcoming deliveries

    This topical spread ensures wide applicability in both customer support automation and logistics optimization use cases.

    Language Diversity & Realism

    The conversations reflect the authentic language and interaction style of Punjabi-speaking customers and delivery agents, incorporating:

    Naming Patterns: Personal names, business names, and logistics company references
    Localized Details: Punjabi-format emails, phone numbers, regional addresses, and delivery zones
    Temporal and Numeric Expressions: Dates, delivery windows, prices, and tracking IDs in Punjabi formats
    Slang and Informal Speech: Everyday expressions and delivery-specific idioms used across Punjabi dialects

    This linguistic realism enables the development of context-aware and naturally responsive AI systems.

    Conversational Structure & Flow

    The dataset captures a diverse range of interaction types and delivery workflows:

    Dialogue Types:
    Quick status checks and confirmations
    Multi-turn issue resolution
    Process walkthroughs and guidance
    Feedback and escalation handling
    Common Flow Elements:
    Greetings and caller verification
    Request or complaint initiation
    <div style="margin-left: 60px;

  16. s

    Set Data Punjabi

    • ms.shaip.com
    Updated Aug 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Set Data Punjabi [Dataset]. https://ms.shaip.com/offerings/speech-data-catalog/punjabi-dataset/
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Laman Utama Set Data Punjabiਪੰਜਾਬੀ ਡਾਟਾਸੈਟPusat Panggilan Punjabi Berkualiti Tinggi, Perbualan Umum dan Set Data Podcast untuk Model AI & Pertuturan Hubungi Kami Data Pusat Panggilan Data Perbualan Umum Podcast 58312atau91938Pusat Data-Pusat Panggilan 9. .elementor-element.elementor-element-20a0{padding:XNUMXpx XNUMXpx…

  17. F

    Punjabi Scripted Monologue Speech Data for Healthcare

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Introducing the Punjabi Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Punjabi language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.

    Speech Data

    This dataset includes over 6,000 high-quality scripted audio prompts recorded in Punjabi, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.

    Participant Diversity
    Speakers: 60 native Punjabi speakers.
    Regional Balance: Participants are sourced from multiple regions across Punjab, reflecting diverse dialects and linguistic traits.
    Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.
    Recording Specifications
    Nature of Recordings: Scripted monologues based on healthcare-related use cases.
    Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.
    Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.
    Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

    Topic Coverage

    The prompts span a broad range of healthcare-specific interactions, such as:

    Patient check-in and follow-up communication
    Appointment booking and cancellation dialogues
    Insurance and regulatory support queries
    Medication, test results, and consultation discussions
    General health tips and wellness advice
    Emergency and urgent care communication
    Technical support for patient portals and apps
    Domain-specific scripted statements and FAQs

    Contextual Depth

    To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:

    Names: Gender- and region-appropriate Punjab names
    Addresses: Varied local address formats spoken naturally
    Dates & Times: References to appointment dates, times, follow-ups, and schedules
    Medical Terminology: Common medical procedures, symptoms, and treatment references
    Numbers & Measurements: Health data like dosages, vitals, and test result values
    Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

    These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.

    Transcription

    Every audio recording is accompanied by a verbatim, manually verified transcription.

    Content: The transcription mirrors the exact scripted prompt recorded by the speaker.
    Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.
    <b style="font-weight:

  18. F

    Punjabi TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Punjabi TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native Punjabi voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    SNR: Minimum 30 dB
    Channel: Mono
    Recording Duration: 20-30 minutes
    Recording Environment: Studio-controlled, acoustically treated rooms
    Per Speaker Volume: 1–2 hours of speech per artist
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native Punjabi speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    Artist Profile:
    Gender: Male and Female
    Age Range: 20–60 years
    Regions: Native Punjabi-speaking states from Punjab
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    Word Count per Script: 3,000–5,000 words per 30-minute session
    Content Types:
    Storytelling
    Script and book reading
    Informational explainers
    Government service instructions
    E-commerce tutorials
    Motivational content
    Health & wellness guides
    Education & career advice
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    Format: Available in plain text and JSON
    Post-processing:
    Corrected for

  19. F

    Punjabi Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Punjabi Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Punjabi -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Punjabi speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Punjabi speakers from our verified contributor community.
    Regions: Representing different regions across Punjab to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Punjabi real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

  20. F

    Punjabi Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Punjabi Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Punjabi -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Punjabi speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Punjabi contributors from our verified pool.
    Regions: Covering multiple Punjab regions to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Punjabi speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shaip (2024). Punjabi Dataset [Dataset]. https://ceb.shaip.com/offerings/speech-data-catalog/punjabi-dataset/

Punjabi Dataset

Explore at:
Dataset updated
Aug 22, 2024
Dataset authored and provided by
Shaip
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Home Punjabi Datasetਪੰਜਾਬੀ ਡਾਟਾਸੈਟHigh-Quality Punjabi Call-Center, General Conversation, ug Podcast Dataset para sa AI ug Speech Models Kontaka Kami Call-Center Data General Conversation Data Podcast Data Call-Center-58312 Data Call-Center-91938 .elementor-element.elementor-element-9a20{padding:0px XNUMXpx…

Search
Clear search
Close search
Google apps
Main menu