49 datasets found
  1. englist_tamil_parallel_sent

    • kaggle.com
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hemanth kumar (2023). englist_tamil_parallel_sent [Dataset]. https://www.kaggle.com/datasets/hemanthkumar21/englist-tamil-parallel-sent
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    hemanth kumar
    Description

    The English-Tamil Parallel Sentences Dataset is a valuable resource for natural language processing (NLP) tasks that require bilingual training data, such as machine translation, cross-lingual information retrieval, and language understanding applications. This dataset contains a collection of parallel sentences in both English and Tamil languages, allowing researchers and developers to build and evaluate robust multilingual NLP models.

    Potential Use Cases:

    1. Machine Translation: Researchers can leverage this dataset to train machine translation models that effectively convert English text to Tamil and vice versa.
    2. Cross-Lingual Information Retrieval: The parallel sentences can be used to develop cross-lingual search systems, enabling users to retrieve relevant information across both languages.
    3. Multilingual Chatbots: Developers can use the dataset to build multilingual chatbots that understand and respond to user queries in English and Tamil.
    4. Sentiment Analysis: Researchers can use the dataset for cross-lingual sentiment analysis, enabling analysis of sentiment in both languages.
  2. F

    English-Tamil Parallel Corpus for the BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Tamil Parallel Corpus for the BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-bfsi-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance(BFSI) domain. This meticulously curated dataset offers a rich collection of bilingual sentence pairs translated between English and Tamil. It serves as a valuable resource for developing domain-specific machine translation systems, language models, and NLP applications within the BFSI sector.

    Dataset Content

    Volume and Diversity
    Extensive Coverage: Contains over 50,000 bilingual sentence pairs, ideal for a wide range of language processing tasks.
    Translator Diversity: Created with the help of 200+ native Tamil translators, ensuring varied linguistic styles, tone, and regional expressions.
    Sentence Diversity
    Word Count: Sentences range from 7 to 25 words, suitable for model training and evaluation.
    Syntactic Variety: Includes simple, compound, and complex sentence structures.
    Grammatical Forms: Interrogative (questions) and imperative (commands), Affirmative and negative statements, Active and passive voice constructions.
    Figurative Language: Incorporates idioms, metaphors, and colloquial expressions relevant to real-world BFSI communications.
    Discourse Features: Includes logical connectors and transitional phrases for coherent, natural language flow.
    Cross Translation: Supports bi-directional translation with content translated both from English to Tamil and Tamil to English.

    Domain-Specific Content

    Specialized Terminology: Covers technical vocabulary from banking, insurance, financial services, compliance, investment, and fintech.
    Authentic Industry Language: Captures real-world usage, including expressions from customer service conversations, financial reporting, and policy documentation.
    Contextual Coverage: Draws content from scenarios such as:
    Banking transactions and statements
    Risk management reports
    Compliance policies
    Claims processing and customer support dialogs
    Cross-Domain Elements: Includes supporting vocabulary from general business, legal, and technology domains, relevant to modern BFSI operations.

    Format and Structure

    File Formats: Delivered in Excel format by default, with easy conversion to JSON, TMX, XML, XLIFF, XLS, and other widely supported industry formats.
    Dataset Structure: Serial Number, Unique ID, Source Sentence and Source Word Count, Target Sentence and Target Word Count

    Usage and Applications

    Machine Translation and Localization: Supports training of accurate translation models and localization systems specific to the BFSI sector.
    NLP Systems: Useful for enhancing tools such as grammar checkers, spell checkers, predictive text, and speech/text understanding engines.
    Large Language Models (LLMs): Enables fine-tuning and bilingual enhancement of LLMs for:
    Financial content generation
    Summarization of market reports
    Automated responses to customer service and

  3. F

    Tamil Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Tamil Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Tamil -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Tamil speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Tamil speakers from our verified contributor community.
    Regions: Representing different regions across Tamil Nadu to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Tamil real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <span

  4. m

    MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in...

    • data.mendeley.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arunmozhi Mourougappane (2025). MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in Tamil) [Dataset]. http://doi.org/10.17632/p59cfx4vx6.2
    Explore at:
    Dataset updated
    Apr 14, 2025
    Authors
    Arunmozhi Mourougappane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a carefully selected set of Tamil film reviews with the goal of advancing NLP research in the areas of text classification, sentiment analysis, and aspect-based sentiment analysis. We have invited users to review twenty-five films using a Google form. Additional reviews were taken from websites such as IMDb and YouTube. From the list of selected aspects, we also made sure that the review collection was based on the presence of at least one target aspect, including cinematography, acting, screenplay, story, director, songs, background music, and editing. About 1,390 reviews total, tagged for positive as well as negative views across eight different categories, make up the dataset.

  5. h

    Tamil-Finetuning-data

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thrisha Sivasakthi (2025). Tamil-Finetuning-data [Dataset]. https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    Thrisha Sivasakthi
    Description

    Dataset Card for Dataset Name

    This dataset is designed for fine-tuning Large Language Models (LLMs) in Tamil, enabling them to understand and generate high-quality Tamil text across multiple domains. It contains 72,000 curated and generated samples, ensuring a rich linguistic diversity that improves model generalization. 🔹 Sources: Kaggle Tamil NLP, Sentiment Analysis datasets, and synthetic data. 🔹 Languages: Tamil, Tanglish (Tamil-English mix), and regional Tamil dialects. 🔹… See the full description on the dataset page: https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data.

  6. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  7. Tamil Speech Dataset – 500 Hours Monologue Audio Corpus

    • nexdata.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Tamil Speech Dataset – 500 Hours Monologue Audio Corpus [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1838
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Speaker, Language, Accuracy Rate, Recording device, Recording condition, Features of annotation
    Description

    This dataset includes 500 hours of scripted Tamil monologue speech collected using smartphones. Each sample is transcribed with text content and metadata such as speaker ID, gender, and age. The dataset features diverse speakers from various regions, making it highly representative of real-world Tamil language use and suitable for automatic speech recognition (ASR), text-to-speech (TTS), voice activity detection (VAD), and natural language processing (NLP) tasks. Validated by leading AI companies, the dataset is designed to enhance model robustness in multilingual environments and low-resource languages. All data was collected in full compliance with global privacy regulations including GDPR, CCPA, and PIPL, ensuring ethical sourcing and responsible AI development.

  8. F

    Tamil Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Tamil General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Tamil usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Tamil conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Tamil speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 150 native Tamil speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Tamil usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  9. m

    General Conversations Speech Dataset of Medical sector in Tamil

    • data.macgence.com
    mp3
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2025). General Conversations Speech Dataset of Medical sector in Tamil [Dataset]. https://data.macgence.com/dataset/general-conversations-speech-dataset-of-medical-sector-in-tamil
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 2, 2025
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Access a professionally curated Tamil speech dataset focused on general conversations in the medical sector. Ideal for training voice AI, speech recognition, and NLP models in healthcare.

  10. e

    EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Nov 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
    Explore at:
    Dataset updated
    Nov 18, 2022
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  11. 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio

    • m.nexdata.ai
    • nexdata.ai
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1390?source=huggingface
    Explore at:
    Dataset updated
    May 9, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Language, Accuracy Rate, Language(Region) Code, Recording environment, Features of annotation
    Description

    203 hours of real-world Tamil speech data featuring both casual conversations and scripted monologues. All audio was recorded from native Tamil speakers across various regions, reflecting real-world linguistic and acoustic diversity. Each sample is manually transcribed and annotated with speaker ID, gender, and other metadata, making it highly suitable for automatic speech recognition (ASR), speech synthesis (TTS), speaker identification, and natural language processing (NLP) applications. The dataset has been validated by leading AI companies and is particularly valuable for training robust AI models for underrepresented languages. All data collection, processing, and usage comply strictly with global data privacy laws including GDPR, CCPA, and PIPL, ensuring legal and ethical use.

  12. m

    Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil

    • data.macgence.com
    mp3
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2025). Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil [Dataset]. https://data.macgence.com/dataset/call-center-conversations-speech-dataset-of-e-commerce-sector-in-tamil
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore our high-quality Tamil speech dataset featuring real call center conversations from the e-commerce sector. Ideal for speech recognition, NLP, and AI training applications.

  13. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

    • data.macgence.com
    mp3
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Finance [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-finance
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 17, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    High-quality Tamil speech dataset featuring Indian agent-customer finance calls, ideal for ASR, NLP, and voice AI model training.

  14. h

    Ainkurunooru

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    தமிழ் தகவல் (2025). Ainkurunooru [Dataset]. https://huggingface.co/datasets/TamilThagaval/Ainkurunooru
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset authored and provided by
    தமிழ் தகவல்
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Narrinai Poems

      Dataset Summary
    

    This dataset contains Narrinai, one of the eight anthologies (Ettuthogai) of Sangam Tamil literature.It consists of 400+ poems written by various poets, each poem categorized by number and title.
    The dataset provides:

    Poem Number
    Title (in Tamil)
    Poem Text

    This can be used for NLP tasks in Tamil, such as text generation, retrieval, classification, and cultural/literary research.

      Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/TamilThagaval/Ainkurunooru.
    
  15. h

    indic-nlp

    • huggingface.co
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Bagarua (2025). indic-nlp [Dataset]. https://huggingface.co/datasets/ayushbagaria17/indic-nlp
    Explore at:
    Dataset updated
    Jun 12, 2025
    Authors
    Ayush Bagarua
    Description

    L3Cube-IndicNews

    L3Cube-IndicNews, is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 11 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, Punjabi and English. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct… See the full description on the dataset page: https://huggingface.co/datasets/ayushbagaria17/indic-nlp.

  16. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil

    • data.macgence.com
    mp3
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore Macgence's Tamil speech dataset of Indian agent-customer call center conversations—ideal for ASR, NLP, and voice AI training applications.

  17. d

    Global English Speech with Accent Conversational Dataset — Multi-Region...

    • datarade.ai
    .wav
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Global English Speech with Accent Conversational Dataset — Multi-Region Validated Speech with Gender, Age & Metadata for AI & NLP Training [Dataset]. https://datarade.ai/data-products/global-english-speech-with-accent-conversational-dataset-mu-filemarket
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Jul 21, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Montenegro, Nicaragua, Iceland, Tonga, United States Minor Outlying Islands, Comoros, Cook Islands, Bangladesh, Yemen, Haiti
    Description

    The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.

    Regions and Covered Countries with Primary Spoken Languages:

    Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)

    Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)

    Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)

    Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)

    Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)

    Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications

    Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling

    This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.

  18. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

    • data.macgence.com
    mp3
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Banking [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-banking
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 21, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore authentic Tamil call center speech data for banking, featuring Indian agents and customers. Curated by Macgence for voice AI and NLP projects.

  19. f

    Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

    • figshare.com
    xlsx
    Updated Oct 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    figshare
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

  20. h

    thirukkural-dataset

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    தமிழ் தகவல் (2025). thirukkural-dataset [Dataset]. https://huggingface.co/datasets/TamilThagaval/thirukkural-dataset
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset authored and provided by
    தமிழ் தகவல்
    Description

    Thirukkural Dataset This dataset contains the full Thirukkural (1330 couplets) in Tamil along with English translations and explanations. It is structured in a way that is easy to use for developers, researchers, and AI projects, especially for NLP, chatbots, or literature analysis. Dataset Structure Each record in the dataset contains the following fields: Field Description verse_number Numeric index of the verse (1–1330) tamil_kural The original Tamil Thirukkural verse english_verse English… See the full description on the dataset page: https://huggingface.co/datasets/TamilThagaval/thirukkural-dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
hemanth kumar (2023). englist_tamil_parallel_sent [Dataset]. https://www.kaggle.com/datasets/hemanthkumar21/englist-tamil-parallel-sent
Organization logo

englist_tamil_parallel_sent

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
hemanth kumar
Description

The English-Tamil Parallel Sentences Dataset is a valuable resource for natural language processing (NLP) tasks that require bilingual training data, such as machine translation, cross-lingual information retrieval, and language understanding applications. This dataset contains a collection of parallel sentences in both English and Tamil languages, allowing researchers and developers to build and evaluate robust multilingual NLP models.

Potential Use Cases:

  1. Machine Translation: Researchers can leverage this dataset to train machine translation models that effectively convert English text to Tamil and vice versa.
  2. Cross-Lingual Information Retrieval: The parallel sentences can be used to develop cross-lingual search systems, enabling users to retrieve relevant information across both languages.
  3. Multilingual Chatbots: Developers can use the dataset to build multilingual chatbots that understand and respond to user queries in English and Tamil.
  4. Sentiment Analysis: Researchers can use the dataset for cross-lingual sentiment analysis, enabling analysis of sentiment in both languages.
Search
Clear search
Close search
Google apps
Main menu