68 datasets found
  1. Tamil (Tamizh) Wikipedia Text Dataset for NLP

    • kaggle.com
    zip
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles
    Explore at:
    zip(339341289 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    Younus_Mohamed
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

    What’s Included

    - Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

    - Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

    Why This Dataset?

    Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

    ** How You Can Use This Dataset**

    - Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

    Let’s Collaborate!

    I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

    License

    This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

  2. Tamil NLP

    • kaggle.com
    zip
    Updated Mar 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/datasets/sudalairajkumar/tamil-nlp/code
    Explore at:
    zip(3067205 bytes)Available download formats
    Dataset updated
    Mar 11, 2019
    Authors
    SRK
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Indic NLP - Natural Language Processing for Indian Languages.

    This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

    Content

    The dataset has the following files.

    Tamil News Classficaition

    This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

    • tamil_news_train.csv - Train dataset for tamil news classification.
    • tamil_news_test.csv - Test dataset for tamil news classification

    Tamil Movie Review Dataset

    This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

    • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
    • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

    Thirukkural Dataset

    From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

    I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

    • tamil_thirukkural_train - train dataset having 1064 rows
    • tamil_thirukkural_test - test dataset having 266 rows

    Will add more datasets in the following versions.

    Acknowledgements

    My sincere thanks to :

    • Malaikannan for starting this initiative
    • Selvakumar for getting the data
    • Vijay Anand for the Thirukkural data

    Inspiration

    Some questions which can be answered are

    1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
    2. How does the Language models do for Tamil?

    And lot more interesting questions to be answered.

    Checkout this link to find similar and dissimilar words for Tamil.

  3. Tamil and Tanglish YT Video Transcripts for NLP

    • kaggle.com
    zip
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). Tamil and Tanglish YT Video Transcripts for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-and-tanglish-yt-video-transcripts-for-nlp
    Explore at:
    zip(158098806 bytes)Available download formats
    Dataset updated
    Jan 14, 2025
    Authors
    Younus_Mohamed
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Tamil and Tanglish YT Video Transcripts for NLP
    This dataset provides a comprehensive collection of processed and combined YouTube video transcripts featuring Tamil and Tanglish (Tamil-English mixed) text. It is designed to support various Natural Language Processing (NLP) tasks, including sentiment analysis, language modeling, transliteration, and code-switching studies.

    Key Features

    • Processed Text: Contains cleaned transcripts for noise-free analysis.
    • Tanglish Support: Focuses on Tamil-English code-switched text, commonly found in digital platforms.
    • Diverse Topics: Includes transcripts from multiple topics, such as news, entertainment, and education, offering a rich context for NLP applications.
    • Wide Applicability: Suitable for text classification, translation, and low-resource language modeling tasks.

    Columns

    • topic: The topic category of the transcript (e.g., Tamil News, Tamil Weather Updates).
    • video_id: Unique identifier for each video.
    • start: Timestamp for when the text appears in the video.
    • text: Original transcript text from the video.
    • cleaned_text: Processed transcript text with unnecessary characters and noise removed.

    Use Cases

    • Develop NLP models for Tamil-English code-switched data.
    • Analyze sentiment or trends in Tamil and Tanglish content.
    • Train language models for low-resource and multilingual applications.

    Acknowledgment

    This dataset was carefully curated and cleaned to empower researchers, developers, and linguists to explore and advance NLP for Tamil and Tanglish text.

    Tags

    • Natural Language Processing (NLP)
    • Tamil NLP
    • Tanglish
    • Code-Switching
    • Low-Resource Languages
    • Multilingual NLP
    • Sentiment Analysis
    • Text Mining

    License

    The dataset is released under the Public Domain Dedication (CC0) license, allowing free use, modification, and sharing.

  4. F

    English-Tamil Parallel Corpus for the BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Tamil Parallel Corpus for the BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-bfsi-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance(BFSI) domain. This meticulously curated dataset offers a rich collection of bilingual sentence pairs translated between English and Tamil. It serves as a valuable resource for developing domain-specific machine translation systems, language models, and NLP applications within the BFSI sector.

    Dataset Content

    Volume and Diversity
    Extensive Coverage: Contains over 50,000 bilingual sentence pairs, ideal for a wide range of language processing tasks.
    Translator Diversity: Created with the help of 200+ native Tamil translators, ensuring varied linguistic styles, tone, and regional expressions.
    Sentence Diversity
    Word Count: Sentences range from 7 to 25 words, suitable for model training and evaluation.
    Syntactic Variety: Includes simple, compound, and complex sentence structures.
    Grammatical Forms: Interrogative (questions) and imperative (commands), Affirmative and negative statements, Active and passive voice constructions.
    Figurative Language: Incorporates idioms, metaphors, and colloquial expressions relevant to real-world BFSI communications.
    Discourse Features: Includes logical connectors and transitional phrases for coherent, natural language flow.
    Cross Translation: Supports bi-directional translation with content translated both from English to Tamil and Tamil to English.

    Domain-Specific Content

    Specialized Terminology: Covers technical vocabulary from banking, insurance, financial services, compliance, investment, and fintech.
    Authentic Industry Language: Captures real-world usage, including expressions from customer service conversations, financial reporting, and policy documentation.
    Contextual Coverage: Draws content from scenarios such as:
    Banking transactions and statements
    Risk management reports
    Compliance policies
    Claims processing and customer support dialogs
    Cross-Domain Elements: Includes supporting vocabulary from general business, legal, and technology domains, relevant to modern BFSI operations.

    Format and Structure

    File Formats: Delivered in Excel format by default, with easy conversion to JSON, TMX, XML, XLIFF, XLS, and other widely supported industry formats.
    Dataset Structure: Serial Number, Unique ID, Source Sentence and Source Word Count, Target Sentence and Target Word Count

    Usage and Applications

    Machine Translation and Localization: Supports training of accurate translation models and localization systems specific to the BFSI sector.
    NLP Systems: Useful for enhancing tools such as grammar checkers, spell checkers, predictive text, and speech/text understanding engines.
    Large Language Models (LLMs): Enables fine-tuning and bilingual enhancement of LLMs for:
    Financial content generation
    Summarization of market reports
    Automated responses to customer service and

  5. h

    Tamil-Finetuning-data

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thrisha Sivasakthi (2025). Tamil-Finetuning-data [Dataset]. https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    Thrisha Sivasakthi
    Description

    Dataset Card for Dataset Name

    This dataset is designed for fine-tuning Large Language Models (LLMs) in Tamil, enabling them to understand and generate high-quality Tamil text across multiple domains. It contains 72,000 curated and generated samples, ensuring a rich linguistic diversity that improves model generalization. 🔹 Sources: Kaggle Tamil NLP, Sentiment Analysis datasets, and synthetic data. 🔹 Languages: Tamil, Tanglish (Tamil-English mix), and regional Tamil dialects. 🔹… See the full description on the dataset page: https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data.

  6. 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio

    • m.nexdata.ai
    • nexdata.ai
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1390?source=huggingface
    Explore at:
    Dataset updated
    Jul 16, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Language, Accuracy Rate, Language(Region) Code, Recording environment, Features of annotation
    Description

    203 hours of real-world Tamil speech data featuring both casual conversations and scripted monologues. All audio was recorded from native Tamil speakers across various regions, reflecting real-world linguistic and acoustic diversity. Each sample is manually transcribed and annotated with speaker ID, gender, and other metadata, making it highly suitable for automatic speech recognition (ASR), speech synthesis (TTS), speaker identification, and natural language processing (NLP) applications. The dataset has been validated by leading AI companies and is particularly valuable for training robust AI models for underrepresented languages. All data collection, processing, and usage comply strictly with global data privacy laws including GDPR, CCPA, and PIPL, ensuring legal and ethical use.

  7. m

    Data from: MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie...

    • data.mendeley.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arunmozhi Mourougappane (2025). MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in Tamil) [Dataset]. http://doi.org/10.17632/p59cfx4vx6.2
    Explore at:
    Dataset updated
    Apr 14, 2025
    Authors
    Arunmozhi Mourougappane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a carefully selected set of Tamil film reviews with the goal of advancing NLP research in the areas of text classification, sentiment analysis, and aspect-based sentiment analysis. We have invited users to review twenty-five films using a Google form. Additional reviews were taken from websites such as IMDb and YouTube. From the list of selected aspects, we also made sure that the review collection was based on the presence of at least one target aspect, including cinematography, acting, screenplay, story, director, songs, background music, and editing. About 1,390 reviews total, tagged for positive as well as negative views across eight different categories, make up the dataset.

  8. Ponniyan selvan Tamil Book for NLP

    • kaggle.com
    zip
    Updated Sep 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/datasets/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp/discussion
    Explore at:
    zip(1985053 bytes)Available download formats
    Dataset updated
    Sep 9, 2020
    Authors
    Dinesh Kumar Sarangapani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Dinesh Kumar Sarangapani

    Released under CC0: Public Domain

    Contents

  9. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  10. Tamil Words Frequency

    • kaggle.com
    zip
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laaveshwaran Parthiban (2023). Tamil Words Frequency [Dataset]. https://www.kaggle.com/datasets/aviiciii/tamil-words-frequency
    Explore at:
    zip(42398729 bytes)Available download formats
    Dataset updated
    Jul 3, 2023
    Authors
    Laaveshwaran Parthiban
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Github Repo: https://github.com/aviiciii/tamil-word-frequency

    This is output data of the above project done to analyse the frequency of words in tamil language from various sources of data.

    The Kaggle dataset used in this repository provides a comprehensive collection of Tamil language text data for Natural Language Processing (NLP) tasks. The dataset has been curated and compiled from various sources, and it serves as a valuable resource for linguistic research, NLP model training, and data analysis in the Tamil language domain.

    The dataset contains a significant amount of Tamil language text, offering a diverse range of topics and genres. It includes a variety of text sources, such as news articles, books, blogs, social media content, and more. The texts cover a wide spectrum of vocabulary, enabling researchers and practitioners to explore different aspects of the Tamil language.

  11. F

    Tamil Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Tamil Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Tamil -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Tamil speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Tamil speakers from our verified contributor community.
    Regions: Representing different regions across Tamil Nadu to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Tamil real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <span

  12. h

    tamil-english-colloquial-translations

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandhini Varadharajan, tamil-english-colloquial-translations [Dataset]. https://huggingface.co/datasets/nandhinivaradharajan14/tamil-english-colloquial-translations
    Explore at:
    Authors
    Nandhini Varadharajan
    Description

    Dataset: English-Tamil (en-ta) Parallel Corpus

    This dataset contains parallel sentences in English and Tamil (en-ta) that have been curated from multiple sources. It is designed for tasks such as machine translation, language modeling, and other natural language processing (NLP) applications involving English and Tamil.

      Dataset Composition
    

    The dataset is composed of three main parts, which have been concatenated into a single file with two columns: ta (Tamil) and… See the full description on the dataset page: https://huggingface.co/datasets/nandhinivaradharajan14/tamil-english-colloquial-translations.

  13. E

    Data from: EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

    • live.european-language-grid.eu
    binary format
    Updated Oct 30, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1085
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Oct 30, 2014
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  14. h

    indic-nlp

    • huggingface.co
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Bagarua (2025). indic-nlp [Dataset]. https://huggingface.co/datasets/ayushbagaria17/indic-nlp
    Explore at:
    Dataset updated
    Jun 12, 2025
    Authors
    Ayush Bagarua
    Description

    L3Cube-IndicNews

    L3Cube-IndicNews, is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 11 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, Punjabi and English. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct… See the full description on the dataset page: https://huggingface.co/datasets/ayushbagaria17/indic-nlp.

  15. TamilSentiMix NLP

    • kaggle.com
    zip
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Arvidsson (2023). TamilSentiMix NLP [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/tamilsentimix
    Explore at:
    zip(1891 bytes)Available download formats
    Dataset updated
    Oct 18, 2023
    Authors
    Joakim Arvidsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

    Introductory Paper

    Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

    By Bharathi Raja Chakravarthi, V. Muralidaran, R. Priyadharshini, John P. McCrae. 2020

    Published in Workshop on Spoken Language Technologies for Under-resourced Languages

  16. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil

    • data.macgence.com
    mp3
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide, India
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore Macgence's Tamil speech dataset of Indian agent-customer call center conversations—ideal for ASR, NLP, and voice AI training applications.

  17. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

    • data.macgence.com
    mp3
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Finance [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-finance
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 17, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide, India
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    High-quality Tamil speech dataset featuring Indian agent-customer finance calls, ideal for ASR, NLP, and voice AI model training.

  18. Tamil - Language Corpus for NLP

    • kaggle.com
    zip
    Updated Apr 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen (2020). Tamil - Language Corpus for NLP [Dataset]. https://www.kaggle.com/datasets/praveengovi/tamil-language-corpus-for-nlp/discussion
    Explore at:
    zip(2440281343 bytes)Available download formats
    Dataset updated
    Apr 8, 2020
    Authors
    Praveen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    https://cms.qz.com/wp-content/uploads/2017/04/tamil.jpg?quality=75&strip=all&w=1400" alt="">

    Context

    Tamil is one of the longest-surviving classical languages in the world.It described as "the only language of contemporary India which is recognizably continuous with a classical past. The variety and quality of classical Tamil literature has led to it being described as "one of the great classical traditions and literatures of the world".

    Tamil language Corpus helps researches,IT professionals and students to create tamil language models for classifying sentiments , Topic modeling , text summarisation , text generation ,Named Entity recognition ,Knowledge graph and Chatbot

    Content

    Tamil language Corpus consist of articles from Wikipedia & Tamil daily news , Dataset split into train and test for ease of use in building machine learning models

    Acknowledgements

    Thanks to Vanagamudi and Gaurov for contribution to tamil NLP and dataset used for their NLP is really helpful to prepare this dataset

    https://github.com/vanangamudi/tamil-lm2 https://github.com/goru001/nlp-for-tamil

    Inspiration

    Evolving the tamil language in Artificial Intelligence world & contribute to education and research

  19. F

    English-Tamil Parallel Corpus for the Legal Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Tamil Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English-Tamil Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.

    Dataset Content

    Volume and Translator Diversity
    Sentence Count: Over 50,000 bilingual sentence pairs
    Translator Base: More than 200 native Tamil linguists with domain familiarity contributed to the translation process
    Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
    Sentence Variety
    Length Range: Sentences contain 7 to 25 words
    Grammatical Structures: Includes simple, compound, and complex sentences
    Form Types: Covers questions, commands, affirmations, and negations
    Voice Representation: Balanced use of active and passive sentence constructions
    Cross Translation: Dataset includes both English-to-Tamil and Tamil-to-English segments to ensure bidirectional support
    Linguistic Features:
    Idiomatic expressions and legal jargon
    Sentence connectors and discourse markers to preserve argument structure and legal reasoning

    Legal Domain Specialization

    Legal Terminology Coverage

    This dataset includes terminology across a wide range of legal subdomains such as:

    Contracts, agreements, and commercial law
    Criminal and civil litigation
    Legal procedures, rulings, and statutory interpretation
    Administrative, constitutional, and regulatory terms
    Courtroom dialogue, judgments, and legal advisories
    Contextual Diversity

    Sentence pairs are drawn from realistic legal content types, including:

    Legal briefs, affidavits, and memoranda
    Terms of service and data protection policies
    Research articles and legal scholarship
    Standard forms and templates
    Legislative, policy, and compliance language
    Cross-Domain Elements

    To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:

    Government policy
    Business and finance
    Technology, IP, and cybersecurity law

    Format and Structure

    Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats
    Included Fields:
    Serial Number
    Unique ID
    Source Sentence and Word Count
    Target Sentence and Word Count

    Use Cases and Applications

    Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation
    Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines

  20. m

    Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

    • data.macgence.com
    mp3
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Banking [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-banking
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 21, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide, India
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    Explore authentic Tamil call center speech data for banking, featuring Indian agents and customers. Curated by Macgence for voice AI and NLP projects.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles
Organization logo

Tamil (Tamizh) Wikipedia Text Dataset for NLP

Building a High-Resource Future for Tamil in NLP: Collaborative Efforts for Data

Explore at:
zip(339341289 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
Younus_Mohamed
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

** How You Can Use This Dataset**

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

Search
Clear search
Close search
Google apps
Main menu