52 datasets found

F
English-Tamil Parallel Corpus for the BFSI Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Tamil Parallel Corpus for the BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-bfsi-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance(BFSI) domain. This meticulously curated dataset offers a rich collection of bilingual sentence pairs translated between English and Tamil. It serves as a valuable resource for developing domain-specific machine translation systems, language models, and NLP applications within the BFSI sector.
Dataset Content
•Volume and Diversity
•
Extensive Coverage: Contains over 50,000 bilingual sentence pairs, ideal for a wide range of language processing tasks.

•
Translator Diversity: Created with the help of 200+ native Tamil translators, ensuring varied linguistic styles, tone, and regional expressions.

•Sentence Diversity
•
Word Count: Sentences range from 7 to 25 words, suitable for model training and evaluation.

•
Syntactic Variety: Includes simple, compound, and complex sentence structures.

•
Grammatical Forms: Interrogative (questions) and imperative (commands), Affirmative and negative statements, Active and passive voice constructions.

•
Figurative Language: Incorporates idioms, metaphors, and colloquial expressions relevant to real-world BFSI communications.

•
Discourse Features: Includes logical connectors and transitional phrases for coherent, natural language flow.

•
Cross Translation: Supports bi-directional translation with content translated both from English to Tamil and Tamil to English.

Domain-Specific Content
•
Specialized Terminology: Covers technical vocabulary from banking, insurance, financial services, compliance, investment, and fintech.

•
Authentic Industry Language: Captures real-world usage, including expressions from customer service conversations, financial reporting, and policy documentation.

•
Contextual Coverage: Draws content from scenarios such as:

•Banking transactions and statements
•Risk management reports
•Compliance policies
•Claims processing and customer support dialogs
•
Cross-Domain Elements: Includes supporting vocabulary from general business, legal, and technology domains, relevant to modern BFSI operations.

Format and Structure
•
File Formats: Delivered in Excel format by default, with easy conversion to JSON, TMX, XML, XLIFF, XLS, and other widely supported industry formats.

•
Dataset Structure: Serial Number, Unique ID, Source Sentence and Source Word Count, Target Sentence and Target Word Count

Usage and Applications
•
Machine Translation and Localization: Supports training of accurate translation models and localization systems specific to the BFSI sector.

•
NLP Systems: Useful for enhancing tools such as grammar checkers, spell checkers, predictive text, and speech/text understanding engines.

•
Large Language Models (LLMs): Enables fine-tuning and bilingual enhancement of LLMs for:

•Financial content generation
•Summarization of market reports
•Automated responses to customer service and
F
Tamil Call Center Data for Realestate AI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-tamil-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This Tamil Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Tamil -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
Speech Data
The dataset features 30 hours of dual-channel call center recordings between native Tamil speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
•Participant Diversity:
•
Speakers: 60 native Tamil speakers from our verified contributor community.

•
Regions: Representing different regions across Tamil Nadu to ensure accent and dialect variation.

•
Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted agent-customer discussions.

•
Call Duration: Average 5–15 minutes per call.

•
Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.

•
Recording Environment: Captured in noise-free and echo-free conditions.

Topic Diversity
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
•Inbound Calls:
•Property Inquiries
•Rental Availability
•Renovation Consultation
•Property Features & Amenities
•Investment Property Evaluation
•Ownership History & Legal Info, and more
•Outbound Calls:
•New Listing Notifications
•Post-Purchase Follow-ups
•Property Recommendations
•Value Updates
•Customer Satisfaction Surveys, and others
Such domain-rich variety ensures model generalization across common real estate support conversations.
Transcription
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., background noise, pauses)
•High transcription accuracy with word error rate below 5% via dual-layer human review.
These transcriptions streamline ASR and NLP development for Tamil real estate voice applications.
Metadata
Detailed metadata accompanies each participant and conversation:
•
Participant Metadata: ID, age, gender, location, accent, and dialect.

•
Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

This enables smart filtering, dialect-focused model training, and structured dataset exploration.
Usage and Applications
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
<span
h
Tamil-Finetuning-data
huggingface.co
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thrisha Sivasakthi (2025). Tamil-Finetuning-data [Dataset]. https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2025
Authors
Thrisha Sivasakthi
Description
Dataset Card for Dataset Name

This dataset is designed for fine-tuning Large Language Models (LLMs) in Tamil, enabling them to understand and generate high-quality Tamil text across multiple domains. It contains 72,000 curated and generated samples, ensuring a rich linguistic diversity that improves model generalization. 🔹 Sources: Kaggle Tamil NLP, Sentiment Analysis datasets, and synthetic data. 🔹 Languages: Tamil, Tanglish (Tamil-English mix), and regional Tamil dialects. 🔹… See the full description on the dataset page: https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data.
m
MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in...
data.mendeley.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arunmozhi Mourougappane (2025). MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in Tamil) [Dataset]. http://doi.org/10.17632/p59cfx4vx6.2
Explore at:
Unique identifier
https://doi.org/10.17632/p59cfx4vx6.2
Dataset updated
Apr 14, 2025
Authors
Arunmozhi Mourougappane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a carefully selected set of Tamil film reviews with the goal of advancing NLP research in the areas of text classification, sentiment analysis, and aspect-based sentiment analysis. We have invited users to review twenty-five films using a Google form. Additional reviews were taken from websites such as IMDb and YouTube. From the list of selected aspects, we also made sure that the review collection was based on the presence of at least one target aspect, including cinematography, acting, screenplay, story, director, songs, background music, and editing. About 1,390 reviews total, tagged for positive as well as negative views across eight different categories, make up the dataset.
Claim Detection and Matching for Indian Languages
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4890950
Dataset updated
Jun 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.
, etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.
Tamil Speech Dataset – 500 Hours Monologue Audio Corpus
nexdata.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). Tamil Speech Dataset – 500 Hours Monologue Audio Corpus [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1838
Explore at:
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Nexdata
Variables measured
Format, Speaker, Language, Accuracy Rate, Recording device, Recording condition, Features of annotation
Description
This dataset includes 500 hours of scripted Tamil monologue speech collected using smartphones. Each sample is transcribed with text content and metadata such as speaker ID, gender, and age. The dataset features diverse speakers from various regions, making it highly representative of real-world Tamil language use and suitable for automatic speech recognition (ASR), text-to-speech (TTS), voice activity detection (VAD), and natural language processing (NLP) tasks. Validated by leading AI companies, the dataset is designed to enhance model robustness in multilingual environments and low-resource languages. All data was collected in full compliance with global privacy regulations including GDPR, CCPA, and PIPL, ensuring ethical sourcing and responsible AI development.
F
Tamil Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Tamil General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Tamil usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Tamil conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Tamil speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 150 native Tamil speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level Tamil usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
e
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND
b2find.eudat.eu
Updated Nov 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
Explore at:
Dataset updated
Nov 18, 2022
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
m
General Conversations Speech Dataset of Medical sector in Tamil
data.macgence.com
mp3
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). General Conversations Speech Dataset of Medical sector in Tamil [Dataset]. https://data.macgence.com/dataset/general-conversations-speech-dataset-of-medical-sector-in-tamil
Explore at:
mp3Available download formats
Dataset updated
Jun 2, 2025
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Access a professionally curated Tamil speech dataset focused on general conversations in the medical sector. Ideal for training voice AI, speech recognition, and NLP models in healthcare.
203 Hours Tamil Speech Dataset – Conversation & Monologue Audio
nexdata.ai
m.nexdata.ai
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1390
Explore at:
Dataset updated
Apr 7, 2025
Dataset authored and provided by
Nexdata
Variables measured
Format, Country, Language, Accuracy Rate, Language(Region) Code, Recording environment, Features of annotation
Description
203 hours of real-world Tamil speech data featuring both casual conversations and scripted monologues. All audio was recorded from native Tamil speakers across various regions, reflecting real-world linguistic and acoustic diversity. Each sample is manually transcribed and annotated with speaker ID, gender, and other metadata, making it highly suitable for automatic speech recognition (ASR), speech synthesis (TTS), speaker identification, and natural language processing (NLP) applications. The dataset has been validated by leading AI companies and is particularly valuable for training robust AI models for underrepresented languages. All data collection, processing, and usage comply strictly with global data privacy laws including GDPR, CCPA, and PIPL, ensuring legal and ethical use.
m
Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil
data.macgence.com
mp3
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil [Dataset]. https://data.macgence.com/dataset/call-center-conversations-speech-dataset-of-e-commerce-sector-in-tamil
Explore at:
mp3Available download formats
Dataset updated
Jun 8, 2025
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore our high-quality Tamil speech dataset featuring real call center conversations from the e-commerce sector. Ideal for speech recognition, NLP, and AI training applications.
h
indic-nlp
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Bagarua (2025). indic-nlp [Dataset]. https://huggingface.co/datasets/ayushbagaria17/indic-nlp
Explore at:
Dataset updated
Jun 12, 2025
Authors
Ayush Bagarua
Description
L3Cube-IndicNews

L3Cube-IndicNews, is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 11 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, Punjabi and English. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct… See the full description on the dataset page: https://huggingface.co/datasets/ayushbagaria17/indic-nlp.
m
Indian Agent to Indian Customer call center Speech Dataset in Tamil for...
data.macgence.com
mp3
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Finance [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-finance
Explore at:
mp3Available download formats
Dataset updated
Mar 17, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
High-quality Tamil speech dataset featuring Indian agent-customer finance calls, ideal for ASR, NLP, and voice AI model training.
m
Indian Agent to Indian Customer call center Speech Dataset in Tamil
data.macgence.com
mp3
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil
Explore at:
mp3Available download formats
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore Macgence's Tamil speech dataset of Indian agent-customer call center conversations—ideal for ASR, NLP, and voice AI training applications.
h
Ettuthogai_Ainkurunooru
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
தமிழ் தகவல் (2025). Ettuthogai_Ainkurunooru [Dataset]. https://huggingface.co/datasets/TamilThagaval/Ettuthogai_Ainkurunooru
Explore at:
Dataset updated
Aug 31, 2025
Dataset authored and provided by
தமிழ் தகவல்
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Narrinai Poems

Dataset Summary

This dataset contains Narrinai, one of the eight anthologies (Ettuthogai) of Sangam Tamil literature.It consists of 400+ poems written by various poets, each poem categorized by number and title.
The dataset provides:

Poem Number
Title (in Tamil)
Poem Text

This can be used for NLP tasks in Tamil, such as text generation, retrieval, classification, and cultural/literary research.

Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/TamilThagaval/Ettuthogai_Ainkurunooru.
m
Indian Agent to Indian Customer call center Speech Dataset in Tamil for...
data.macgence.com
mp3
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Banking [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-banking
Explore at:
mp3Available download formats
Dataset updated
Mar 21, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore authentic Tamil call center speech data for banking, featuring Indian agents and customers. Curated by Macgence for voice AI and NLP projects.
F
Tamil Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Tamil Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Tamil-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native Tamil speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Tamil healthcare communication and includes:
•
Authentic Naming Patterns: Tamil personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Tamil formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Tamil-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
d
Global English Speech with Accent Conversational Dataset — Multi-Region...
datarade.ai
.wav
Updated Jul 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2025). Global English Speech with Accent Conversational Dataset — Multi-Region Validated Speech with Gender, Age & Metadata for AI & NLP Training [Dataset]. https://datarade.ai/data-products/global-english-speech-with-accent-conversational-dataset-mu-filemarket
Explore at:
.wavAvailable download formats
Dataset updated
Jul 21, 2025
Dataset authored and provided by
FileMarket
Area covered
Nicaragua, Montenegro, Cook Islands, United States Minor Outlying Islands, Tonga, Iceland, Comoros, Bangladesh, Yemen, Haiti
Description
The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.

Regions and Covered Countries with Primary Spoken Languages:

Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)

Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)

Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)

Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)

Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)

Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications

Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling

This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.
f
Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...
figshare.com
xlsx
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27072247.v1
Dataset updated
Oct 12, 2024
Dataset provided by
figshare
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
F
English-Tamil Parallel Corpus for the Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Tamil Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English-Tamil Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
Dataset Content
•Volume and Translator Diversity
•Sentence Count: Over 50,000 bilingual sentence pairs
•Translator Base: More than 200 native Tamil linguists with domain familiarity contributed to the translation process
•Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
•Sentence Variety
•Length Range: Sentences contain 7 to 25 words
•Grammatical Structures: Includes simple, compound, and complex sentences
•Form Types: Covers questions, commands, affirmations, and negations
•Voice Representation: Balanced use of active and passive sentence constructions
•Cross Translation: Dataset includes both English-to-Tamil and Tamil-to-English segments to ensure bidirectional support
•Linguistic Features:
•Idiomatic expressions and legal jargon
•Sentence connectors and discourse markers to preserve argument structure and legal reasoning
Legal Domain Specialization
•Legal Terminology Coverage
This dataset includes terminology across a wide range of legal subdomains such as:
•Contracts, agreements, and commercial law
•Criminal and civil litigation
•Legal procedures, rulings, and statutory interpretation
•Administrative, constitutional, and regulatory terms
•Courtroom dialogue, judgments, and legal advisories
•Contextual Diversity
Sentence pairs are drawn from realistic legal content types, including:
•Legal briefs, affidavits, and memoranda
•Terms of service and data protection policies
•Research articles and legal scholarship
•Standard forms and templates
•Legislative, policy, and compliance language
•Cross-Domain Elements
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
•Government policy
•Business and finance
•Technology, IP, and cybersecurity law
Format and Structure
•
Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats

•Included Fields:
•Serial Number
•Unique ID
•Source Sentence and Word Count
•Target Sentence and Word Count
Use Cases and Applications
•
Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation

•
Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines

Facebook

Twitter

Click to copy link

Link copied

Cite

FutureBee AI (2022). English-Tamil Parallel Corpus for the BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-bfsi-domain

English-Tamil Parallel Corpus for the BFSI Domain

BFSI domain translated text corpus in English-Tamil for NLP

Explore at:

wavAvailable download formats

Dataset updated

Aug 1, 2022

Dataset provided by

FutureBeeAI

Authors

FutureBee AI

License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by

FutureBeeAI

Description

Introduction

Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance(BFSI) domain. This meticulously curated dataset offers a rich collection of bilingual sentence pairs translated between English and Tamil. It serves as a valuable resource for developing domain-specific machine translation systems, language models, and NLP applications within the BFSI sector.

Dataset Content

•Volume and Diversity

•

Extensive Coverage: Contains over 50,000 bilingual sentence pairs, ideal for a wide range of language processing tasks.

•

Translator Diversity: Created with the help of 200+ native Tamil translators, ensuring varied linguistic styles, tone, and regional expressions.

•Sentence Diversity

•

Word Count: Sentences range from 7 to 25 words, suitable for model training and evaluation.

•

Syntactic Variety: Includes simple, compound, and complex sentence structures.

•

Grammatical Forms: Interrogative (questions) and imperative (commands), Affirmative and negative statements, Active and passive voice constructions.

•

Figurative Language: Incorporates idioms, metaphors, and colloquial expressions relevant to real-world BFSI communications.

•

Discourse Features: Includes logical connectors and transitional phrases for coherent, natural language flow.

•

Cross Translation: Supports bi-directional translation with content translated both from English to Tamil and Tamil to English.

Domain-Specific Content

•

Specialized Terminology: Covers technical vocabulary from banking, insurance, financial services, compliance, investment, and fintech.

•

Authentic Industry Language: Captures real-world usage, including expressions from customer service conversations, financial reporting, and policy documentation.

•

Contextual Coverage: Draws content from scenarios such as:

•Banking transactions and statements

•Risk management reports

•Compliance policies

•Claims processing and customer support dialogs

•

Cross-Domain Elements: Includes supporting vocabulary from general business, legal, and technology domains, relevant to modern BFSI operations.

Format and Structure

•

File Formats: Delivered in Excel format by default, with easy conversion to JSON, TMX, XML, XLIFF, XLS, and other widely supported industry formats.

•

Dataset Structure: Serial Number, Unique ID, Source Sentence and Source Word Count, Target Sentence and Target Word Count

Usage and Applications

•

Machine Translation and Localization: Supports training of accurate translation models and localization systems specific to the BFSI sector.

•

NLP Systems: Useful for enhancing tools such as grammar checkers, spell checkers, predictive text, and speech/text understanding engines.

•

Large Language Models (LLMs): Enables fine-tuning and bilingual enhancement of LLMs for:

•Financial content generation

•Summarization of market reports

•Automated responses to customer service and

Clear search

Close search

Google apps

Main menu

English-Tamil Parallel Corpus for the BFSI Domain

Introduction

Dataset Content

Domain-Specific Content

Format and Structure

Usage and Applications

Tamil Call Center Data for Realestate AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Tamil-Finetuning-data

MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in...

Claim Detection and Matching for Indian Languages

Tamil Speech Dataset – 500 Hours Monologue Audio Corpus

Tamil Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

General Conversations Speech Dataset of Medical sector in Tamil

203 Hours Tamil Speech Dataset – Conversation & Monologue Audio

Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil

indic-nlp

Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

Indian Agent to Indian Customer call center Speech Dataset in Tamil

Ettuthogai_Ainkurunooru

Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

Tamil Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Global English Speech with Accent Conversational Dataset — Multi-Region...

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

English-Tamil Parallel Corpus for the Legal Domain

Introduction

Dataset Content

Legal Domain Specialization

Format and Structure

Use Cases and Applications

English-Tamil Parallel Corpus for the BFSI DomainSee More Versions

BFSI domain translated text corpus in English-Tamil for NLP

Introduction

Dataset Content

Domain-Specific Content

Format and Structure

Usage and Applications

English-Tamil Parallel Corpus for the BFSI Domain