Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.
- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.
- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.
Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.
- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.
I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.
This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Indic NLP - Natural Language Processing for Indian Languages.
This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.
The dataset has the following files.
Tamil News Classficaition
This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link
Tamil Movie Review Dataset
This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link
Thirukkural Dataset
From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.
I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.
Will add more datasets in the following versions.
My sincere thanks to :
Some questions which can be answered are
And lot more interesting questions to be answered.
Checkout this link to find similar and dissimilar words for Tamil.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Tamil and Tanglish YT Video Transcripts for NLP
This dataset provides a comprehensive collection of processed and combined YouTube video transcripts featuring Tamil and Tanglish (Tamil-English mixed) text. It is designed to support various Natural Language Processing (NLP) tasks, including sentiment analysis, language modeling, transliteration, and code-switching studies.
This dataset was carefully curated and cleaned to empower researchers, developers, and linguists to explore and advance NLP for Tamil and Tanglish text.
The dataset is released under the Public Domain Dedication (CC0) license, allowing free use, modification, and sharing.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance(BFSI) domain. This meticulously curated dataset offers a rich collection of bilingual sentence pairs translated between English and Tamil. It serves as a valuable resource for developing domain-specific machine translation systems, language models, and NLP applications within the BFSI sector.
Facebook
TwitterDataset Card for Dataset Name
This dataset is designed for fine-tuning Large Language Models (LLMs) in Tamil, enabling them to understand and generate high-quality Tamil text across multiple domains. It contains 72,000 curated and generated samples, ensuring a rich linguistic diversity that improves model generalization. 🔹 Sources: Kaggle Tamil NLP, Sentiment Analysis datasets, and synthetic data. 🔹 Languages: Tamil, Tanglish (Tamil-English mix), and regional Tamil dialects. 🔹… See the full description on the dataset page: https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data.
Facebook
Twitter203 hours of real-world Tamil speech data featuring both casual conversations and scripted monologues. All audio was recorded from native Tamil speakers across various regions, reflecting real-world linguistic and acoustic diversity. Each sample is manually transcribed and annotated with speaker ID, gender, and other metadata, making it highly suitable for automatic speech recognition (ASR), speech synthesis (TTS), speaker identification, and natural language processing (NLP) applications. The dataset has been validated by leading AI companies and is particularly valuable for training robust AI models for underrepresented languages. All data collection, processing, and usage comply strictly with global data privacy laws including GDPR, CCPA, and PIPL, ensuring legal and ethical use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a carefully selected set of Tamil film reviews with the goal of advancing NLP research in the areas of text classification, sentiment analysis, and aspect-based sentiment analysis. We have invited users to review twenty-five films using a Google form. Additional reviews were taken from websites such as IMDb and YouTube. From the list of selected aspects, we also made sure that the review collection was based on the presence of at least one target aspect, including cinematography, acting, screenplay, story, director, songs, background music, and editing. About 1,390 reviews total, tagged for positive as well as negative views across eight different categories, make up the dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Dinesh Kumar Sarangapani
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.
The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".
The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".
All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.
, etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Github Repo: https://github.com/aviiciii/tamil-word-frequency
This is output data of the above project done to analyse the frequency of words in tamil language from various sources of data.
The Kaggle dataset used in this repository provides a comprehensive collection of Tamil language text data for Natural Language Processing (NLP) tasks. The dataset has been curated and compiled from various sources, and it serves as a valuable resource for linguistic research, NLP model training, and data analysis in the Tamil language domain.
The dataset contains a significant amount of Tamil language text, offering a diverse range of topics and genres. It includes a variety of text sources, such as news articles, books, blogs, social media content, and more. The texts cover a wide spectrum of vocabulary, enabling researchers and practitioners to explore different aspects of the Tamil language.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Tamil Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Tamil -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
The dataset features 30 hours of dual-channel call center recordings between native Tamil speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
Such domain-rich variety ensures model generalization across common real estate support conversations.
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
These transcriptions streamline ASR and NLP development for Tamil real estate voice applications.
Detailed metadata accompanies each participant and conversation:
This enables smart filtering, dialect-focused model training, and structured dataset exploration.
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
Facebook
TwitterDataset: English-Tamil (en-ta) Parallel Corpus
This dataset contains parallel sentences in English and Tamil (en-ta) that have been curated from multiple sources. It is designed for tasks such as machine translation, language modeling, and other natural language processing (NLP) applications involving English and Tamil.
Dataset Composition
The dataset is composed of three main parts, which have been concatenated into a single file with two columns: ta (Tamil) and… See the full description on the dataset page: https://huggingface.co/datasets/nandhinivaradharajan14/tamil-english-colloquial-translations.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
Facebook
TwitterL3Cube-IndicNews
L3Cube-IndicNews, is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 11 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, Punjabi and English. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct… See the full description on the dataset page: https://huggingface.co/datasets/ayushbagaria17/indic-nlp.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.
Introductory Paper
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
By Bharathi Raja Chakravarthi, V. Muralidaran, R. Priyadharshini, John P. McCrae. 2020
Published in Workshop on Spoken Language Technologies for Under-resourced Languages
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Explore Macgence's Tamil speech dataset of Indian agent-customer call center conversations—ideal for ASR, NLP, and voice AI training applications.
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
High-quality Tamil speech dataset featuring Indian agent-customer finance calls, ideal for ASR, NLP, and voice AI model training.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://cms.qz.com/wp-content/uploads/2017/04/tamil.jpg?quality=75&strip=all&w=1400" alt="">
Tamil is one of the longest-surviving classical languages in the world.It described as "the only language of contemporary India which is recognizably continuous with a classical past. The variety and quality of classical Tamil literature has led to it being described as "one of the great classical traditions and literatures of the world".
Tamil language Corpus helps researches,IT professionals and students to create tamil language models for classifying sentiments , Topic modeling , text summarisation , text generation ,Named Entity recognition ,Knowledge graph and Chatbot
Tamil language Corpus consist of articles from Wikipedia & Tamil daily news , Dataset split into train and test for ease of use in building machine learning models
Thanks to Vanagamudi and Gaurov for contribution to tamil NLP and dataset used for their NLP is really helpful to prepare this dataset
https://github.com/vanangamudi/tamil-lm2 https://github.com/goru001/nlp-for-tamil
Evolving the tamil language in Artificial Intelligence world & contribute to education and research
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English-Tamil Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
This dataset includes terminology across a wide range of legal subdomains such as:
Sentence pairs are drawn from realistic legal content types, including:
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
Facebook
Twitterhttps://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Explore authentic Tamil call center speech data for banking, featuring Indian agents and customers. Curated by Macgence for voice AI and NLP projects.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.
- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.
- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.
Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.
- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.
I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.
This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.