17 datasets found
  1. F

    Indian English Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    India
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Indian English Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for English -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Indian English speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Indian English contributors from our verified pool.
    Regions: Covering multiple India provinces to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train English speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left:

  2. India News Headlines Dataset

    • kaggle.com
    zip
    Updated Nov 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Kulkarni (2023). India News Headlines Dataset [Dataset]. https://www.kaggle.com/datasets/therohk/india-headlines-news-dataset/discussion
    Explore at:
    zip(97613967 bytes)Available download formats
    Dataset updated
    Nov 11, 2023
    Authors
    Rohit Kulkarni
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    India
    Description

    Context

    This news dataset is a persistent historical archive of noteable events in the Indian subcontinent from start-2001 to q2-2023, recorded in real-time by the journalists of India. It contains approximately 3.8 million events published by Times of India.

    A majority of the data is focusing on Indian local news including national, city level and entertainment.

    Prepared by Rohit Kulkarni

    Content

    Time Range : Start Date: 2001-01-01 ; End Date: 2023-06-30

    CSV Rows: 3,876,557

    Columns: 1. publish_date: Date of the article being published online in yyyyMMdd format 2. headline_category: Category of the headline, ascii, dot delimited, lowercase values 3. headline_text: Text of the Headline in English, only ascii characters

    Inspiration

    Times Group as a news agency, reaches out a very wide audience across Asia and drawfs every other agency in the quantity of English articles published per day. Due to the heavy daily volume (avg. 600 articles) over multiple years, this data offers a deep insight into Indian society, its priorities, events, issues and talking points and how they have unfolded over time.

    It is possible to chop this dataset into a smaller piece based on one or more facets.

    • Time Range: Headlines during 2006 Mumbai bombings, 2014 election, ongoing health crisis
    • One or more Categories: like Citywise, Bollywood, ICC updates, Magazine, Middle East
    • One or more Keywords: like crime or ecology related tokens, names of political parties, celebrities, corporations.

    Similar news datasets exploring other attributes, countries and topics can be seen on my profile.

  3. Number of data compromises and impacted individuals in U.S. 2005-2024

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Number of data compromises and impacted individuals in U.S. 2005-2024 [Dataset]. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.

  4. F

    Kannada Scripted Monologue Speech Data in Travel Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Kannada Scripted Monologue Speech Data in Travel Domain [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/travel-scripted-speech-monologues-kannada-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Algerian Arabic Scripted Monologue Speech Dataset for the Travel domain, a carefully constructed resource created to support the development of Arabic speech recognition technologies, particularly for applications in travel, tourism, and customer service automation.

    Speech Data

    This training dataset features 6,000+ high-quality scripted prompt recordings in Algerian Arabic, crafted to simulate real-world Travel industry conversations. It’s ideal for building robust ASR systems, virtual assistants, and customer interaction tools.

    Participant Diversity
    Speakers: 60 native Algerian Arabic speakers.
    Geographic Coverage: Participants from multiple regions across Algeria to ensure rich diversity in dialects and accents.
    Demographics: Age range from 18 to 70 years, with a gender ratio of approximately 60% male and 40% female.
    Recording Details
    Prompt Type: Scripted monologue-style prompts.
    Duration: Each audio sample ranges from 5 to 30 seconds.
    Audio Format: WAV files with mono channels, 16-bit depth, and 8 kHz / 16 kHz sample rates.
    Environment: Clean, quiet, echo-free spaces to ensure high-quality recordings.

    Topic Coverage

    The dataset includes a wide spectrum of travel-related interactions to reflect diverse real-world scenarios:

    Booking and reservation dialogues
    Customer support and general inquiries
    Destination-specific guidance
    Technical and login help
    Promotional offers and travel deals
    Service availability and policy information
    Domain-specific statements

    Context Elements

    To boost contextual realism, the scripted prompts integrate frequently encountered travel terms and variables:

    Names: Common Algeria male and female names
    Addresses: Regional address formats and locality names
    Dates & Times: Booking dates, travel periods, and time-based interactions
    Destinations: Mention of cities, countries, airports, and tourist landmarks
    Prices & Numbers: Cost of flights, hotel rates, promotional discounts, etc.
    Booking & Confirmation Codes: Typical ticketing and travel identifiers

    Transcription

    Every audio file is paired with a verbatim transcription in .TXT format.

    Consistency: Each transcript matches its corresponding audio file exactly.
    Accuracy: Transcriptions are reviewed and verified by native Algerian Arabic speakers.
    Usability: File names are synced across audio and text for easy integration.

    Metadata

    Each audio file is enriched with detailed metadata to support advanced analytics and filtering:

    Participant Metadata: Unique ID, age, gender, region/state,

  5. F

    Gujarati Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Gujarati Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-gujarati-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Gujarati Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced Gujarati speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Gujarati, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Gujarati speakers.
    Regions: Diverse representation from various Gujarat provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Gujarati linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset is ideal for:

    <div

  6. F

    Indian English Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    India
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Indian English Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced English speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Indian English, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Indian English speakers.
    Regions: Diverse representation from various India provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Indian English linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset is

  7. F

    Tamil Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Tamil Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Tamil -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Tamil speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Tamil contributors from our verified pool.
    Regions: Covering multiple Tamil Nadu regions to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Tamil speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap:

  8. F

    Tamil General Domain Scripted Monologue Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil General Domain Scripted Monologue Speech Data [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Tamil Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Tamil language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Tamil speech data.

    Speech Data

    This dataset features over 6,000 high-quality scripted monologue recordings in Tamil. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.

    Participant Diversity
    Speakers: 60 native Tamil speakers
    Regions: Broad regional coverage ensures diverse accents and dialects
    Demographics: Participants aged 18 to 70, with a 60:40 male-to-female ratio
    Recording Specifications
    Recording Type: Scripted monologues and prompt-based recordings
    Audio Duration: 5 to 30 seconds per file
    Format: WAV, mono channel, 16-bit, 8 kHz & 16 kHz sample rates
    Environment: Clean, noise-free conditions to ensure clarity and usability

    Topic Coverage

    The dataset covers a wide variety of general conversation scenarios, including:

    Daily Conversations
    Topic-Specific Discussions
    General Knowledge and Advice
    Idioms and Sayings

    Contextual Features

    To enhance authenticity, the prompts include:

    Names: Male and female names specific to different Tamil Nadu regions
    Addresses: Commonly used address formats in daily Tamil speech
    Dates & Times: References used in general scheduling and time expressions
    Organization Names: Names of businesses, institutions, and other entities
    Numbers & Currencies: Mentions of quantities, prices, and monetary values

    Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.

    Transcription

    Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.

    Content: Exact match to the spoken audio
    Format: Plain text (.TXT), named identically to the corresponding audio file
    Quality Control: All transcripts are validated by native Tamil transcribers

    Metadata

    Rich metadata is included for detailed filtering and analysis:

    Speaker Metadata: Unique speaker ID, age, gender, region, and dialect
    Audio Metadata: Prompt transcript, recording setup, device specs, sample rate, bit depth, and format

    Applications & Use Cases

    This dataset can power a variety of Tamil language AI technologies, including:

    Speech Recognition Training: ASR model development and fine-tuning
    <div

  9. F

    Tamil Scripted Monologue Speech Data for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Presenting the Tamil Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of Tamil speech recognition and voice AI models specifically tailored for the telecommunications industry.

    Speech Data

    This dataset includes over 6,000 high-quality scripted prompt recordings in Tamil, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.

    Participant Diversity
    Speakers: 60 native Tamil speakers
    Geographic Distribution: Carefully selected from multiple regions across Tamil Nadu to capture a wide spectrum of dialects and speaking styles
    Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years
    Recording Specifications
    Type: Scripted monologue prompts focused on telecom industry use cases
    Duration: Each audio clip ranges from 5 to 30 seconds
    Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz
    Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

    Topic Coverage

    The dataset reflects a wide variety of common telecom customer interactions, including:

    Customer onboarding and service inquiries
    Billing and payment questions
    Data plans and product information
    Technical support requests
    Network coverage discussions
    Regulatory compliance and policy information
    Upgrades, renewals, and service plan changes
    Domain-specific scripted interactions tailored to real-world telecom use cases

    Contextual Depth

    To maximize contextual richness, prompts include:

    Localized Names: Common Tamil Nadu names in various formats
    Addresses: Region-specific address structures for realism
    Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)
    Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.
    Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures
    Service Providers: References to telecom companies and third-party service entities

    Transcription

    Each audio file is paired with an accurate, verbatim transcription for precise model training:

    Content: Transcriptions are direct representations of each recorded prompt
    Format: Plain text (.TXT), with filenames matching their corresponding audio files
    Verification: Every transcription is manually verified by native Tamil linguists to ensure consistency and accuracy

    Metadata

    Detailed metadata is included to enhance dataset usability and

  10. F

    Punjabi Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-punjabi-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Punjabi Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced Punjabi speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Punjabi, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Punjabi speakers.
    Regions: Diverse representation from various Punjab provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Punjabi linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset is ideal for:

    <div

  11. F

    Odia Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Odia Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-odia-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Odia Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 40 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Odia -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 40 hours of dual-channel audio recordings between native Odia speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 80 native Odia contributors from our verified pool.
    Regions: Covering multiple Odisha regions to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Odia speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  12. F

    Tamil Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Tamil Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced Tamil speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Tamil, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Tamil speakers.
    Regions: Diverse representation from various Tamil Nadu provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Tamil linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset is ideal for:

    <div

  13. F

    Malayalam Scripted Monologue Speech Data for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Malayalam Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-malayalam-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Presenting the Malayalam Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of Malayalam speech recognition and voice AI models specifically tailored for the telecommunications industry.

    Speech Data

    This dataset includes over 6,000 high-quality scripted prompt recordings in Malayalam, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.

    Participant Diversity
    Speakers: 60 native Malayalam speakers
    Geographic Distribution: Carefully selected from multiple regions across Kerala to capture a wide spectrum of dialects and speaking styles
    Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years
    Recording Specifications
    Type: Scripted monologue prompts focused on telecom industry use cases
    Duration: Each audio clip ranges from 5 to 30 seconds
    Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz
    Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

    Topic Coverage

    The dataset reflects a wide variety of common telecom customer interactions, including:

    Customer onboarding and service inquiries
    Billing and payment questions
    Data plans and product information
    Technical support requests
    Network coverage discussions
    Regulatory compliance and policy information
    Upgrades, renewals, and service plan changes
    Domain-specific scripted interactions tailored to real-world telecom use cases

    Contextual Depth

    To maximize contextual richness, prompts include:

    Localized Names: Common Kerala names in various formats
    Addresses: Region-specific address structures for realism
    Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)
    Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.
    Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures
    Service Providers: References to telecom companies and third-party service entities

    Transcription

    Each audio file is paired with an accurate, verbatim transcription for precise model training:

    Content: Transcriptions are direct representations of each recorded prompt
    Format: Plain text (.TXT), with filenames matching their corresponding audio files
    Verification: Every transcription is manually verified by native Malayalam linguists to ensure consistency and accuracy

    Metadata

    Detailed metadata is included to enhance dataset

  14. F

    Malayalam Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Malayalam Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-malayalam-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Malayalam Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced Malayalam speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Malayalam, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Malayalam speakers.
    Regions: Diverse representation from various Kerala provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Malayalam linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset is ideal

  15. F

    Telugu Scripted Monologue Speech Dataset for BFSI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Telugu Scripted Monologue Speech Dataset for BFSI [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/bfsi-scripted-speech-monologues-telugu-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Telugu Scripted Monologue Speech Dataset tailored for the BFSI (Banking, Financial Services, and Insurance) domain. This dataset empowers the development of advanced Telugu speech recognition systems, natural language understanding models, and conversational AI solutions focused on the BFSI sector.

    Speech Data

    This dataset includes over 6,000 scripted prompt recordings in Telugu, covering a wide range of realistic banking and finance-related scenarios to support robust ASR and voice AI systems.

    Participant Diversity
    Speakers: 60 native Telugu speakers.
    Regions: Diverse representation from various Andhra Pradesh and Telangana provinces to ensure dialect and accent coverage.
    Demographics: Age range of 18–70, with a male-to-female ratio of 60:40.
    Recording Details
    Nature: Scripted monologues and domain-specific prompt recordings.Duration:
    Audio Format: WAV, mono channel, 16-bit depth, recorded at 8 kHz and 16 kHz sample rates.
    Environment: Clean, echo-free, and noise-free environments.

    Topic & Context Diversity

    This dataset spans multiple BFSI-related themes to simulate practical customer interaction scenarios:

    Customer service interactions
    Financial transactions & balance inquiries
    Banking and insurance product queries
    Loan & credit support
    Regulatory and compliance questions
    Technical help and password resets
    Promotional campaigns and service updates

    Contextual Elements

    To make the dataset as context-rich as possible, each prompt integrates commonly encountered real-world BFSI elements:

    Names: Region-specific names in multiple formats
    Addresses: Local address structures and pronunciations
    Dates & Times: Typical time expressions used in banking
    Organization Names: Names of banks, financial firms, and institutions
    Currencies & Amounts: Spoken currency formats, prices, and numeric data
    IDs & Transaction Numbers: For authentic service simulation

    Transcription

    Every audio file is paired with verbatim transcription to streamline ASR and NLP model development.

    Content: Exact match of each prompt
    Format: Clean .TXT files, mapped to audio file names
    Accuracy: Reviewed and validated by native Telugu linguists

    Metadata

    Each data point is enriched with detailed metadata for advanced training and analysis:

    Participant Metadata: Unique ID, age, gender, state, country, dialect
    Recording Metadata: Transcript, recording setup, sample rate, bit depth, device, file format

    Applications and Use Cases

    This BFSI-focused dataset is ideal

  16. F

    Indian Bengali Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian Bengali Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-bengali-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Bengali Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Bengali -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Bengali speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Bengali contributors from our verified pool.
    Regions: Covering multiple West Bengal regions to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Bengali speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display:

  17. F

    Malayalam Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Malayalam Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-malayalam-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Malayalam Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for Malayalam -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Malayalam speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Malayalam contributors from our verified pool.
    Regions: Covering multiple Kerala regions to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train Malayalam speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px;

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Indian English Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-english-india

Indian English Call Center Data for Travel AI

Indian English call center speech corpus in travel industry

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Area covered
India
Dataset funded by
FutureBeeAI
Description

Introduction

This Indian English Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for English -speaking travelers.

Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

Speech Data

The dataset includes 30 hours of dual-channel audio recordings between native Indian English speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

Participant Diversity:
Speakers: 60 native Indian English contributors from our verified pool.
Regions: Covering multiple India provinces to capture accent and dialectal variation.
Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
Recording Details:
Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
Call Duration: Between 5 and 15 minutes per session.
Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
Recording Environment: Captured in controlled, noise-free, echo-free settings.

Topic Diversity

Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

Inbound Calls:
Booking Assistance
Destination Information
Flight Delays or Cancellations
Support for Disabled Passengers
Health and Safety Travel Inquiries
Lost or Delayed Luggage, and more
Outbound Calls:
Promotional Travel Offers
Customer Feedback Surveys
Booking Confirmations
Flight Rescheduling Alerts
Visa Expiry Notifications, and others

These scenarios help models understand and respond to diverse traveler needs in real-time.

Transcription

Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

Transcription Includes:
Speaker-Segmented Dialogues
Time-Stamped Segments
Non-speech Markers (e.g., pauses, coughs)
High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

Metadata

Extensive metadata enriches each call and speaker for better filtering and AI training:

Participant Metadata: ID, age, gender, region, accent, and dialect.
Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

Usage and Applications

This dataset is ideal for a variety of AI use cases in the travel and tourism space:

ASR Systems: Train English speech-to-text engines for travel platforms.
<div style="margin-top:10px; margin-bottom: 10px; padding-left:

Search
Clear search
Close search
Google apps
Main menu