30 datasets found
  1. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  2. englist_tamil_parallel_sent

    • kaggle.com
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hemanth kumar (2023). englist_tamil_parallel_sent [Dataset]. https://www.kaggle.com/datasets/hemanthkumar21/englist-tamil-parallel-sent
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    hemanth kumar
    Description

    The English-Tamil Parallel Sentences Dataset is a valuable resource for natural language processing (NLP) tasks that require bilingual training data, such as machine translation, cross-lingual information retrieval, and language understanding applications. This dataset contains a collection of parallel sentences in both English and Tamil languages, allowing researchers and developers to build and evaluate robust multilingual NLP models.

    Potential Use Cases:

    1. Machine Translation: Researchers can leverage this dataset to train machine translation models that effectively convert English text to Tamil and vice versa.
    2. Cross-Lingual Information Retrieval: The parallel sentences can be used to develop cross-lingual search systems, enabling users to retrieve relevant information across both languages.
    3. Multilingual Chatbots: Developers can use the dataset to build multilingual chatbots that understand and respond to user queries in English and Tamil.
    4. Sentiment Analysis: Researchers can use the dataset for cross-lingual sentiment analysis, enabling analysis of sentiment in both languages.
  3. F

    English-Tamil Translated Parallel Corpora for BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Tamil Translated Parallel Corpora for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-bfsi-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Tamil, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.

    Dataset Content

    •Volume and Diversity:
    •
    Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.
    •
    Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.
    •Sentence Diversity:
    •
    Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.
    •
    Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.
    •
    Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the BFSI industry.
    •
    Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.
    •
    Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.
    •
    Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the BFSI domain.
    •
    Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.
    •
    Cross Translation: The dataset includes a cross-translation which means a part of the dataset is translated from English to Tamil and another portion is translated from Tamil to English to improve bi-directional translation capabilities.

    Domain Specific Content

    This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.

    •
    Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of BFSI-specific terminology, ranging from technical banking and financial terms to insurance-related vocabulary and regulatory jargon.
    •
    Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the BFSI industry.
    •
    Contexts Specific to BFSI: The corpus encompasses a wide range of contexts specific to the BFSI domain, including financial transactions, regulatory compliance, risk management, customer service interactions, and more.
    •
    Cross-Domain Applicability: While the primary focus is on the BFSI sector, the corpus also includes relevant cross-domain content, such as general business terminology, legal terms, and language related to technology and digital services.

    Format and Structure

    •
    Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.
    •
    Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

    Usage and Application

    •
    Machine Translation and Language Localization: It serves as a valuable training resource for developing robust machine translation engines tailored to the BFSI domain.
    •
    NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.

  4. h

    Tamil-Finetuning-data

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thrisha Sivasakthi (2025). Tamil-Finetuning-data [Dataset]. https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    Thrisha Sivasakthi
    Description

    Dataset Card for Dataset Name

    This dataset is designed for fine-tuning Large Language Models (LLMs) in Tamil, enabling them to understand and generate high-quality Tamil text across multiple domains. It contains 72,000 curated and generated samples, ensuring a rich linguistic diversity that improves model generalization. 🔹 Sources: Kaggle Tamil NLP, Sentiment Analysis datasets, and synthetic data. 🔹 Languages: Tamil, Tanglish (Tamil-English mix), and regional Tamil dialects. 🔹… See the full description on the dataset page: https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data.

  5. m

    MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in...

    • data.mendeley.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arunmozhi Mourougappane (2025). MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in Tamil) [Dataset]. http://doi.org/10.17632/p59cfx4vx6.2
    Explore at:
    Dataset updated
    Apr 14, 2025
    Authors
    Arunmozhi Mourougappane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a carefully selected set of Tamil film reviews with the goal of advancing NLP research in the areas of text classification, sentiment analysis, and aspect-based sentiment analysis. We have invited users to review twenty-five films using a Google form. Additional reviews were taken from websites such as IMDb and YouTube. From the list of selected aspects, we also made sure that the review collection was based on the presence of at least one target aspect, including cinematography, acting, screenplay, story, director, songs, background music, and editing. About 1,390 reviews total, tagged for positive as well as negative views across eight different categories, make up the dataset.

  6. F

    Tamil Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Tamil Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Tamil -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Tamil speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    •Participant Diversity:
    •
    Speakers: 60 native Tamil speakers from our verified contributor community.
    •
    Regions: Representing different regions across Tamil Nadu to ensure accent and dialect variation.
    •
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    •Recording Details:
    •
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    •
    Call Duration: Average 5–15 minutes per call.
    •
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    •
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    •Inbound Calls:
    •Property Inquiries
    •Rental Availability
    •Renovation Consultation
    •Property Features & Amenities
    •Investment Property Evaluation
    •Ownership History & Legal Info, and more
    •Outbound Calls:
    •New Listing Notifications
    •Post-Purchase Follow-ups
    •Property Recommendations
    •Value Updates
    •Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    •Transcription Includes:
    •Speaker-Segmented Dialogues
    •Time-coded Segments
    •Non-speech Tags (e.g., background noise, pauses)
    •High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Tamil real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    •
    Participant Metadata: ID, age, gender, location, accent, and dialect.
    •
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <span

  7. E

    Data from: EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

    • live.european-language-grid.eu
    binary format
    Updated Oct 30, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1085
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Oct 30, 2014
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  8. m

    Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil

    • data.macgence.com
    mp3
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2025). Call Center Conversations Speech Dataset of E-Commerce Sector in Tamil [Dataset]. https://data.macgence.com/dataset/call-center-conversations-speech-dataset-of-e-commerce-sector-in-tamil
    Explore at:
    mp3Available download formats
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore our high-quality Tamil speech dataset featuring real call center conversations from the e-commerce sector. Ideal for speech recognition, NLP, and AI training applications.

  9. 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio

    • nexdata.ai
    • m.nexdata.ai
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1390
    Explore at:
    Dataset updated
    Apr 7, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Language, Accuracy Rate, Language(Region) Code, Recording environment, Features of annotation
    Description

    203 hours of real-world Tamil speech data featuring both casual conversations and scripted monologues. All audio was recorded from native Tamil speakers across various regions, reflecting real-world linguistic and acoustic diversity. Each sample is manually transcribed and annotated with speaker ID, gender, and other metadata, making it highly suitable for automatic speech recognition (ASR), speech synthesis (TTS), speaker identification, and natural language processing (NLP) applications. The dataset has been validated by leading AI companies and is particularly valuable for training robust AI models for underrepresented languages. All data collection, processing, and usage comply strictly with global data privacy laws including GDPR, CCPA, and PIPL, ensuring legal and ethical use.

  10. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

    • data.macgence.com
    mp3
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Finance [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-finance
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 17, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    High-quality Tamil speech dataset featuring Indian agent-customer finance calls, ideal for ASR, NLP, and voice AI model training.

  11. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil

    • data.macgence.com
    mp3
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore Macgence's Tamil speech dataset of Indian agent-customer call center conversations—ideal for ASR, NLP, and voice AI training applications.

  12. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

    • data.macgence.com
    mp3
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Banking [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-banking
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 21, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore authentic Tamil call center speech data for banking, featuring Indian agents and customers. Curated by Macgence for voice AI and NLP projects.

  13. h

    bhasha-wiki

    • huggingface.co
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soket Labs (2024). bhasha-wiki [Dataset]. https://huggingface.co/datasets/soketlabs/bhasha-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 27, 2024
    Dataset authored and provided by
    Soket Labs
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for Bhasha-Wiki

    Translated wikipedia articles

      Dataset Details
    

    Dataset is being updated

      Dataset Description
    

    We have translated 6.4 million English wikipedia articles into 6 Indic languages. The translations were done using IndicTrans2 model.

    Curated by: Soket AI labs Language(s) (NLP): Hindi, Bengali, Gujarati, Tamil, Kannada, Urdu License: cc-by-sa-3.0

      Uses
    

    For pretraining or Fine tuning for Indic language models

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/soketlabs/bhasha-wiki.
    
  14. P

    IndicNLP Corpus Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anoop Kunchukuttan; Divyanshu Kakwani; Satish Golla; Gokul N. C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar (2021). IndicNLP Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/indicnlp-corpus
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Anoop Kunchukuttan; Divyanshu Kakwani; Satish Golla; Gokul N. C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar
    Description

    The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.

  15. h

    indic-nlp

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Bagarua, indic-nlp [Dataset]. https://huggingface.co/datasets/ayushbagaria17/indic-nlp
    Explore at:
    Authors
    Ayush Bagarua
    Description

    L3Cube-IndicNews

    L3Cube-IndicNews, is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 11 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, Punjabi and English. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct… See the full description on the dataset page: https://huggingface.co/datasets/ayushbagaria17/indic-nlp.

  16. P

    IndicCorp Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp
    Explore at:
    Dataset updated
    Mar 10, 2024
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

    Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

    Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

    Downloads

    Language# News Articles*SentencesTokensLink
    as0.60M1.39M32.6Mlink
    bn3.83M39.9M836Mlink
    en3.49M54.3M1.22Blink
    gu2.63M41.1M719Mlink
    hi4.95M63.1M1.86Blink
    kn3.76M53.3M713Mlink
    ml4.75M50.2M721Mlink
    mr2.31M34.0M551Mlink
    or0.69M6.94M107Mlink
    pa2.64M29.2M773Mlink
    ta4.41M31.5M582Mlink
    te3.98M47.9M674Mlink
    • Excluding articles obtained from the OSCAR corpus
  17. F

    General domain Human-Human conversation chats in Tamil

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  18. o

    On the Naturalness of Software

    • explore.openaire.eu
    Updated Jan 1, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abram Hindle; E. T. Barr; Z. Su; P. T. Devanbu; M. Gabel (2012). On the Naturalness of Software [Dataset]. http://doi.org/10.7939/r3-xp7r-ke29
    Explore at:
    Dataset updated
    Jan 1, 2012
    Authors
    Abram Hindle; E. T. Barr; Z. Su; P. T. Devanbu; M. Gabel
    Description

    Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's built-in completion capability. We conclude the paper by laying out a vision for future research in this area.

  19. F

    Tamil Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Tamil Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Tamil speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Tamil speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    •Participant Diversity:
    •
    Speakers: 60 verified native Tamil speakers from our contributor community.
    •
    Regions: Diverse regions across Tamil Nadu to ensure broad dialectal representation.
    •
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    •RecordingDetails:
    •
    Conversation Nature: Naturally flowing, unscripted conversations.
    •
    Call Duration: Each session ranges between 5 to 15 minutes.
    •
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    •
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    •Inbound Calls:
    •Appointment Scheduling
    •New Patient Registration
    •Surgical Consultation
    •Dietary Advice and Consultations
    •Insurance Coverage Inquiries
    •Follow-up Treatment Requests, and more
    •OutboundCalls:
    •Appointment Reminders
    •Preventive Care Campaigns
    •Test Results & Lab Reports
    •Health Risk Assessment Calls
    •Vaccination Updates
    •Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    •Transcription Includes:
    •Speaker-identified Dialogues
    •Time-coded Segments
    •Non-speech Annotations (e.g., silence, cough)
    •High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    •
    Participant Metadata: ID, gender, age, region, accent, and dialect.
    •
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

    •
    <b style="font-weight:

  20. f

    Data_Sheet_1_Development and testing of a multi-lingual Natural Language...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lily Wei Yun Yang; Wei Yan Ng; Xiaofeng Lei; Shaun Chern Yuan Tan; Zhaoran Wang; Ming Yan; Mohan Kashyap Pargi; Xiaoman Zhang; Jane Sujuan Lim; Dinesh Visva Gunasekeran; Franklin Chee Ping Tan; Chen Ee Lee; Khung Keong Yeo; Hiang Khoon Tan; Henry Sun Sien Ho; Benedict Wee Bor Tan; Tien Yin Wong; Kenneth Yung Chiang Kwek; Rick Siow Mong Goh; Yong Liu; Daniel Shu Wei Ting (2023). Data_Sheet_1_Development and testing of a multi-lingual Natural Language Processing-based deep learning system in 10 languages for COVID-19 pandemic crisis: A multi-center study.docx [Dataset]. http://doi.org/10.3389/fpubh.2023.1063466.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Lily Wei Yun Yang; Wei Yan Ng; Xiaofeng Lei; Shaun Chern Yuan Tan; Zhaoran Wang; Ming Yan; Mohan Kashyap Pargi; Xiaoman Zhang; Jane Sujuan Lim; Dinesh Visva Gunasekeran; Franklin Chee Ping Tan; Chen Ee Lee; Khung Keong Yeo; Hiang Khoon Tan; Henry Sun Sien Ho; Benedict Wee Bor Tan; Tien Yin Wong; Kenneth Yung Chiang Kwek; Rick Siow Mong Goh; Yong Liu; Daniel Shu Wei Ting
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PurposeThe COVID-19 pandemic has drastically disrupted global healthcare systems. With the higher demand for healthcare and misinformation related to COVID-19, there is a need to explore alternative models to improve communication. Artificial Intelligence (AI) and Natural Language Processing (NLP) have emerged as promising solutions to improve healthcare delivery. Chatbots could fill a pivotal role in the dissemination and easy accessibility of accurate information in a pandemic. In this study, we developed a multi-lingual NLP-based AI chatbot, DR-COVID, which responds accurately to open-ended, COVID-19 related questions. This was used to facilitate pandemic education and healthcare delivery.MethodsFirst, we developed DR-COVID with an ensemble NLP model on the Telegram platform (https://t.me/drcovid_nlp_chatbot). Second, we evaluated various performance metrics. Third, we evaluated multi-lingual text-to-text translation to Chinese, Malay, Tamil, Filipino, Thai, Japanese, French, Spanish, and Portuguese. We utilized 2,728 training questions and 821 test questions in English. Primary outcome measurements were (A) overall and top 3 accuracies; (B) Area Under the Curve (AUC), precision, recall, and F1 score. Overall accuracy referred to a correct response for the top answer, whereas top 3 accuracy referred to an appropriate response for any one answer amongst the top 3 answers. AUC and its relevant matrices were obtained from the Receiver Operation Characteristics (ROC) curve. Secondary outcomes were (A) multi-lingual accuracy; (B) comparison to enterprise-grade chatbot systems. The sharing of training and testing datasets on an open-source platform will also contribute to existing data.ResultsOur NLP model, utilizing the ensemble architecture, achieved overall and top 3 accuracies of 0.838 [95% confidence interval (CI): 0.826–0.851] and 0.922 [95% CI: 0.913–0.932] respectively. For overall and top 3 results, AUC scores of 0.917 [95% CI: 0.911–0.925] and 0.960 [95% CI: 0.955–0.964] were achieved respectively. We achieved multi-linguicism with nine non-English languages, with Portuguese performing the best overall at 0.900. Lastly, DR-COVID generated answers more accurately and quickly than other chatbots, within 1.12–2.15 s across three devices tested.ConclusionDR-COVID is a clinically effective NLP-based conversational AI chatbot, and a promising solution for healthcare delivery in the pandemic era.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp

IndicCorp Dataset

Explore at:
139 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 10, 2024
Authors
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
Description

IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

Language# News Articles*SentencesTokensLink
as0.60M1.39M32.6Mlink
bn3.83M39.9M836Mlink
en3.49M54.3M1.22Blink
gu2.63M41.1M719Mlink
hi4.95M63.1M1.86Blink
kn3.76M53.3M713Mlink
ml4.75M50.2M721Mlink
mr2.31M34.0M551Mlink
or0.69M6.94M107Mlink
pa2.64M29.2M773Mlink
ta4.41M31.5M582Mlink
te3.98M47.9M674Mlink
  • Excluding articles obtained from the OSCAR corpus
Search
Clear search
Close search
Google apps
Main menu