70 datasets found
  1. Share of English speakers by region India 2019

    • statista.com
    Updated May 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2019). Share of English speakers by region India 2019 [Dataset]. https://www.statista.com/statistics/1007578/india-share-of-english-speakers-by-region/
    Explore at:
    Dataset updated
    May 14, 2019
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2019
    Area covered
    India
    Description

    This statistic represents results of a survey about the share of English speakers across India in 2019, by region. During the surveyed time period, the share of respondents who spoke English in urban areas was around ** percent while this was about ***** percent for rural respondents.

  2. Number of native English speakers in India 1971-2011

    • statista.com
    Updated Mar 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2019). Number of native English speakers in India 1971-2011 [Dataset]. https://www.statista.com/statistics/987540/number-of-native-english-speakers-india/
    Explore at:
    Dataset updated
    Mar 28, 2019
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    1971 - 2011
    Area covered
    India
    Description

    The statistic displays the number of native English speakers in India from 1971 to 2011. About *** thousand Indians recognized English as their mother-tongue according to the 2011 census, up from about ***** thousand speakers in the census of 2001.

  3. Number of English speakers in India 2011, by state

    • statista.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of English speakers in India 2011, by state [Dataset]. https://www.statista.com/statistics/1614218/india-english-speakers-by-state/
    Explore at:
    Dataset updated
    May 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2011
    Area covered
    India
    Description

    Nearly 260,000 speakers reported to speak English as their mother-tongue in India as per the latest census. Of these, Maharastra had the highest number of English speakers, followed by Tamil Nadu.

  4. F

    Indian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Indian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Indian English.
    Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  5. F

    Indian English Retail Scripted Monologue Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Retail Scripted Monologue Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/retail-scripted-speech-monologues-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    India
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Indian English Scripted Monologue Speech Dataset for the Retail & E-commerce domain. This dataset is built to accelerate the development of English language speech technologies especially for use in retail-focused automatic speech recognition (ASR), natural language processing (NLP), voicebots, and conversational AI applications.

    Speech Data

    This training dataset includes 6,000+ high-quality scripted audio recordings in Indian English, created to reflect real-world scenarios in the Retail & E-commerce sector. These prompts are tailored to improve the accuracy and robustness of customer-facing speech technologies.

    Participant Diversity
    Speakers: 60 native English speakers from across India
    Geographic Coverage: Multiple India regions to ensure dialect and accent diversity
    Demographics: Participants aged 18 to 70, with a 60:40 male-to-female distribution
    Recording Details
    Nature of Recording: Scripted monologue-style speech prompts
    Duration: Each recording spans 5 to 30 seconds
    Audio Format: WAV format, mono channel, 16-bit depth, and 8kHz / 16kHz sample rates
    Environment: Recorded in quiet conditions, free from background noise and echo

    Topic Diversity

    This dataset includes a comprehensive set of retail-specific topics to ensure wide linguistic coverage for AI training:

    Customer Service Interactions
    Order Placement and Payment Processes
    Product and Service Inquiries
    Technical Support Queries
    General Information and Guidance
    Promotional and Sales Announcements
    Domain-Specific Service Statements

    Contextual Enrichment

    To increase training utility, prompts include contextual data such as:

    Region-Specific Names: Common India male and female names in diverse formats
    Addresses: Localized address variations spoken naturally
    Dates & Times: Realistic phrasing in delivery, promotions, and return policies
    Product References: Real-world product names, brands, and categories
    Numerical Data: Spoken numbers and prices used in transactions and offers
    Order IDs & Tracking Numbers: Common references in customer service calls

    These additions help your models learn to recognize structured and unstructured retail-related speech.

    Transcription

    Every audio file is paired with a verbatim transcription, ensuring consistency and alignment for model training.

    Content: Exact scripted prompts as spoken by the participant
    Format: Provided in plain text (.TXT) format with filenames matching the associated audio
    Quality Assurance: All transcripts are verified for accuracy by native English transcribers

    Metadata

    Detailed metadata is included to support filtering, analysis, and model evaluation:

    <span

  6. Number of Indian and English language internet users in India 2011-2021

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Number of Indian and English language internet users in India 2011-2021 [Dataset]. https://www.statista.com/statistics/718420/internet-user-base-by-language-india/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    India
    Description

    This statistic displays the number of Indian and English language internet users across India from 2011 to 2021. In 2016, the number of English internet users amounted to about *** million and was projected to increase to *** million in 2021. For Indian language users, this number was about *** million users in 2016, and was projected to reach *** million in 2021.

  7. F

    Indian English Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Indian English Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Indian English speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Indian English speakers from our verified contributor community.
    Regions: Representing different provinces across India to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for English real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  8. E

    Indian English Speech Data by Mobile Phone - 1,012 Hours

    • catalog.elda.org
    Updated Oct 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2022). Indian English Speech Data by Mobile Phone - 1,012 Hours [Dataset]. https://catalog.elda.org/en-us/repository/browse/ELRA-S0456/
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elda.org/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elda.org/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    India
    Description

    Indian English audio data captured by mobile phones, 1,012 hours in total, recorded by 2,100 Indian native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.Format:16kHz, 16bit, uncompressed wav, mono channelRecording environment:quiet indoor environment, low background noise, without echoRecording content (read speech):generic category; human-machine interaction category; smart home command and control category; in-car command and control category; numbersDemographics:2,100 speakers totally, with 52% males and 48% females; and 81% speakers of all are in the age group of 18-25,18% speakers of all are in the age group of 26-45, 1% speakers of all are in the age group of 46-60Device:Android mobile phone, iPhoneLanguage:India EnglishApplication scenario:speech recognition, voiceprint recognition

  9. Most common languages spoken in India 2011

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most common languages spoken in India 2011 [Dataset]. https://www.statista.com/statistics/616508/most-common-languages-india/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2011
    Area covered
    India
    Description

    Hindi, with over *** million native speakers was the most spoken language across Indian homes, followed by Bengali with ** million speakers, as of 2011 census data. English native speakers accounted for about *** thousand during the measured time period. The colonial rule in India One of the most remarkable and widespread legacies that the British colonial rule left behind was the English language. Before independence, the English language was the solely used for higher education and in government and administrative processes. Post-independence, however, and till today, Hindi was claimed as the language with official government patronage. This lead to resistance from the southern states of India, where Hindi did not have prominence. Consequently, the Official Languages Act of 1963, was enacted by the parliament, which ensured the continued use of English for official purposes in conjunction with Hindi. Multi-linguistic cultures India has approximately ** major languages that are written in about ** different scripts. While the country’s official languages are both, English and Hindi, Hindi remains the most preferred language used online especially in the northern rural areas. The use of English is becoming increasingly popular in the urban areas. In addition, almost every state in India has its own official language that is studied in primary and secondary school as an obligatory second language. Among the most prominent are Bengali, Marathi, and Telugu.

  10. 300 Hours English(India) Speech Dataset - Conversational Dialogue by...

    • m.nexdata.ai
    • nexdata.ai
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 300 Hours English(India) Speech Dataset - Conversational Dialogue by Smartphone [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1519?source=huggingface
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Variables measured
    Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    This dataset provides 300 hours of Indian English conversational speech collected via smartphones from 390 native speakers. Dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(390 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  11. F

    Indian English Call Center Data for Delivery & Logistics AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Call Center Data for Delivery & Logistics AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/delivery-call-center-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Indian English Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.

    Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Indian English speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.

    Participant Diversity:
    Speakers: 60 native Indian English speakers from our verified contributor pool.
    Regions: Multiple provinces of India for accent and dialect diversity.
    Participant Profile: Balanced gender distribution (60% male, 40% female) with ages ranging from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted customer-agent dialogues.
    Call Duration: 5 to 15 minutes on average.
    Audio Format: Stereo WAV, 16-bit depth, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in clean, noise-free, echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.

    Inbound Calls:
    Order Tracking
    Delivery Complaints
    Undeliverable Addresses
    Return Process Enquiries
    Delivery Method Selection
    Order Modifications, and more
    Outbound Calls:
    Delivery Confirmations
    Subscription Offer Calls
    Incorrect Address Follow-ups
    Missed Delivery Notifications
    Delivery Feedback Surveys
    Out-of-Stock Alerts, and others

    This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.

    Transcription

    All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, noise)
    High transcription accuracy with word error rate under 5% via dual-layer quality checks.

    These transcriptions support fast, reliable model development for English voice AI applications in the delivery sector.

    Metadata

    Detailed metadata is included for each participant and conversation:

    Participant Metadata: ID, age, gender, region, accent, dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical attributes.

    This metadata aids in training specialized models, filtering demographics, and running advanced analytics.

    Usage and Applications

    <p

  12. Nexdata | English(India) Spontaneous Dialogue Smartphone speech dataset |...

    • datarade.ai
    Updated Nov 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Nexdata | English(India) Spontaneous Dialogue Smartphone speech dataset | 562 Hours [Dataset]. https://datarade.ai/data-products/nexdata-english-india-spontaneous-dialogue-smartphone-spee-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Nov 12, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Description

    English(India) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(390 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

    Format

    16 kHz, 16 bit, uncompressed wav, mono channel;

    Content category

    Dialogue based on given topics

    Recording condition

    Low background noise (indoor)

    Recording device

    Android smartphone, iPhone

    Country

    India(IN)

    Language(Region) Code

    en-IN

    Language

    English

    Speaker

    734 native speakers in total

    Features of annotation

    Transcription text, timestamp, speaker ID, gender, noise

    Accuracy rate

    Word Correct rate(WCR) 98%

  13. 1,012 Hours - Indian English Speech Data by Mobile Phone

    • nexdata.ai
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 1,012 Hours - Indian English Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/940?source=Huggingface
    Explore at:
    Dataset updated
    Oct 5, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    India
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    English(India) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers( 2,100 Indian native speakers), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  14. p

    Spoken English Locations Data for India

    • poidata.io
    csv, json
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Business Data Provider (2025). Spoken English Locations Data for India [Dataset]. https://poidata.io/brand-report/spoken-english/india
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Business Data Provider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025
    Area covered
    India
    Variables measured
    Website URL, Phone Number, Review Count, Business Name, Email Address, Business Hours, Customer Rating, Business Address, Brand Affiliation, Geographic Coordinates
    Description

    Comprehensive dataset containing 78 verified Spoken English locations in India with complete contact information, ratings, reviews, and location data.

  15. Common languages used for web content 2025, by share of websites

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2025
    Area covered
    Worldwide
    Description

    As of October 2025, English was the dominant language for online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of web content, followed by German with 5.9 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  16. The most spoken languages worldwide 2025

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  17. F

    Indian English Call Center Data for Travel AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Call Center Data for Travel AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    India
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Indian English Call Center Speech Dataset for the Travel industry is purpose-built to power the next generation of voice AI applications for travel booking, customer support, and itinerary assistance. With over 30 hours of unscripted, real-world conversations, the dataset enables the development of highly accurate speech recognition and natural language understanding models tailored for English -speaking travelers.

    Created by FutureBeeAI, this dataset supports researchers, data scientists, and conversational AI teams in building voice technologies for airlines, travel portals, and hospitality platforms.

    Speech Data

    The dataset includes 30 hours of dual-channel audio recordings between native Indian English speakers engaged in real travel-related customer service conversations. These audio files reflect a wide variety of topics, accents, and scenarios found across the travel and tourism industry.

    Participant Diversity:
    Speakers: 60 native Indian English contributors from our verified pool.
    Regions: Covering multiple India provinces to capture accent and dialectal variation.
    Participant Profile: Balanced representation of age (18–70) and gender (60% male, 40% female).
    Recording Details:
    Conversation Nature: Naturally flowing, spontaneous customer-agent calls.
    Call Duration: Between 5 and 15 minutes per session.
    Audio Format: Stereo WAV, 16-bit depth, at 8kHz and 16kHz.
    Recording Environment: Captured in controlled, noise-free, echo-free settings.

    Topic Diversity

    Inbound and outbound conversations span a wide range of real-world travel support situations with varied outcomes (positive, neutral, negative).

    Inbound Calls:
    Booking Assistance
    Destination Information
    Flight Delays or Cancellations
    Support for Disabled Passengers
    Health and Safety Travel Inquiries
    Lost or Delayed Luggage, and more
    Outbound Calls:
    Promotional Travel Offers
    Customer Feedback Surveys
    Booking Confirmations
    Flight Rescheduling Alerts
    Visa Expiry Notifications, and others

    These scenarios help models understand and respond to diverse traveler needs in real-time.

    Transcription

    Each call is accompanied by manually curated, high-accuracy transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-Stamped Segments
    Non-speech Markers (e.g., pauses, coughs)
    High transcription accuracy by dual-layered transcription review ensures word error rate under 5%.

    Metadata

    Extensive metadata enriches each call and speaker for better filtering and AI training:

    Participant Metadata: ID, age, gender, region, accent, and dialect.
    Conversation Metadata: Topic, domain, call type, sentiment, and audio specs.

    Usage and Applications

    This dataset is ideal for a variety of AI use cases in the travel and tourism space:

    ASR Systems: Train English speech-to-text engines for travel platforms.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left:

  18. p

    Spoken English Classes Locations Data for India

    • poidata.io
    csv, json
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Business Data Provider (2025). Spoken English Classes Locations Data for India [Dataset]. https://poidata.io/brand-report/spoken-english-classes/india
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Nov 15, 2025
    Dataset authored and provided by
    Business Data Provider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025
    Area covered
    India
    Variables measured
    Website URL, Phone Number, Review Count, Business Name, Email Address, Business Hours, Customer Rating, Business Address, Brand Affiliation, Geographic Coordinates
    Description

    Comprehensive dataset containing 55 verified Spoken English Classes locations in India with complete contact information, ratings, reviews, and location data.

  19. m

    Indian Sign Language Video and Text dataset for sentences (ISLVT)

    • data.mendeley.com
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prachi Waghmare (2024). Indian Sign Language Video and Text dataset for sentences (ISLVT) [Dataset]. http://doi.org/10.17632/98mzk82wbb.1
    Explore at:
    Dataset updated
    Mar 28, 2024
    Authors
    Prachi Waghmare
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    This video and gloss-based dataset has been meticulously crafted to enhance the precision and resilience of ISL (Indian Sign Language) gesture recognition and generation systems. Our goal in sharing this dataset is to contribute to the research community, providing a valuable resource for fellow researchers to explore and innovate in the realm of sign language recognition and generation.Overview of the Dataset: Comprising a diverse array of ISL gesture videos and gloss datasets. The term "gloss" in this context often refers to a written or spoken description of the meaning of a sign, allowing for the representation of sign language in a written form. The dataset includes information about the corresponding spoken or written language and the gloss for each sign. Key components of a sign language gloss dataset include ISL grammar that follows a layered approach, incorporating specific spatial indices for tense and a lexicon with compounds. It follows a unique word order based on noun, verb, object, adjective, or part of a question. Marathi sign language follows the subject-object-verb (SOV) form, facilitating comprehension and adaptation to regional languages. This Marathi sign language gloss aims to become a medium for everyday communication among deaf individuals. This dataset reflects a careful curation process, simulating real-world scenarios. The original videos showcase a variety of gestures performed by a professional signer capturing a broad spectrum of sign language expressions. Incorporating Realism with green screen with controlled lighting conditions. All videos within this dataset adhere to pixels, ensuring uniformity for data presentation and facilitating streamlined pre-processing and model development stored in a format compatible with various machine and Deep learning frameworks, these videos seamlessly integrate into the research pipeline

  20. F

    Indian English Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Indian English Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of English speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Indian English speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Indian English speakers from our contributor community.
    Regions: Diverse provinces across India to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

    <b

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2019). Share of English speakers by region India 2019 [Dataset]. https://www.statista.com/statistics/1007578/india-share-of-english-speakers-by-region/
Organization logo

Share of English speakers by region India 2019

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
May 14, 2019
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2019
Area covered
India
Description

This statistic represents results of a survey about the share of English speakers across India in 2019, by region. During the surveyed time period, the share of respondents who spoke English in urban areas was around ** percent while this was about ***** percent for rural respondents.

Search
Clear search
Close search
Google apps
Main menu