68 datasets found

Tamil (Tamizh) Wikipedia Text Dataset for NLP
kaggle.com
zip
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles
Explore at:
zip(339341289 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
Younus_Mohamed
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

** How You Can Use This Dataset**

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
Tamil NLP
kaggle.com
zip
Updated Mar 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/datasets/sudalairajkumar/tamil-nlp/code
Explore at:
zip(3067205 bytes)Available download formats
Dataset updated
Mar 11, 2019
Authors
SRK
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

tamil_news_train.csv - Train dataset for tamil news classification.

tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews

tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

tamil_thirukkural_train - train dataset having 1064 rows

tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

Malaikannan for starting this initiative

Selvakumar for getting the data

Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

Can we do text classification for Tamil languages and get good accuracies similar to other languages?

How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.
Tamil and Tanglish YT Video Transcripts for NLP
kaggle.com
zip
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2025). Tamil and Tanglish YT Video Transcripts for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-and-tanglish-yt-video-transcripts-for-nlp
Explore at:
zip(158098806 bytes)Available download formats
Dataset updated
Jan 14, 2025
Authors
Younus_Mohamed
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Tamil and Tanglish YT Video Transcripts for NLP
This dataset provides a comprehensive collection of processed and combined YouTube video transcripts featuring Tamil and Tanglish (Tamil-English mixed) text. It is designed to support various Natural Language Processing (NLP) tasks, including sentiment analysis, language modeling, transliteration, and code-switching studies.

Key Features

Processed Text: Contains cleaned transcripts for noise-free analysis.

Tanglish Support: Focuses on Tamil-English code-switched text, commonly found in digital platforms.

Diverse Topics: Includes transcripts from multiple topics, such as news, entertainment, and education, offering a rich context for NLP applications.

Wide Applicability: Suitable for text classification, translation, and low-resource language modeling tasks.

Columns

topic: The topic category of the transcript (e.g., Tamil News, Tamil Weather Updates).

video_id: Unique identifier for each video.

start: Timestamp for when the text appears in the video.

text: Original transcript text from the video.

cleaned_text: Processed transcript text with unnecessary characters and noise removed.

Use Cases

Develop NLP models for Tamil-English code-switched data.

Analyze sentiment or trends in Tamil and Tanglish content.

Train language models for low-resource and multilingual applications.

Acknowledgment

This dataset was carefully curated and cleaned to empower researchers, developers, and linguists to explore and advance NLP for Tamil and Tanglish text.

Tags

Natural Language Processing (NLP)

Tamil NLP

Tanglish

Code-Switching

Low-Resource Languages

Multilingual NLP

Sentiment Analysis

Text Mining

License

The dataset is released under the Public Domain Dedication (CC0) license, allowing free use, modification, and sharing.
F
English-Tamil Parallel Corpus for the BFSI Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Tamil Parallel Corpus for the BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-bfsi-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance(BFSI) domain. This meticulously curated dataset offers a rich collection of bilingual sentence pairs translated between English and Tamil. It serves as a valuable resource for developing domain-specific machine translation systems, language models, and NLP applications within the BFSI sector.
Dataset Content
•Volume and Diversity
•
Extensive Coverage: Contains over 50,000 bilingual sentence pairs, ideal for a wide range of language processing tasks.

•
Translator Diversity: Created with the help of 200+ native Tamil translators, ensuring varied linguistic styles, tone, and regional expressions.

•Sentence Diversity
•
Word Count: Sentences range from 7 to 25 words, suitable for model training and evaluation.

•
Syntactic Variety: Includes simple, compound, and complex sentence structures.

•
Grammatical Forms: Interrogative (questions) and imperative (commands), Affirmative and negative statements, Active and passive voice constructions.

•
Figurative Language: Incorporates idioms, metaphors, and colloquial expressions relevant to real-world BFSI communications.

•
Discourse Features: Includes logical connectors and transitional phrases for coherent, natural language flow.

•
Cross Translation: Supports bi-directional translation with content translated both from English to Tamil and Tamil to English.

Domain-Specific Content
•
Specialized Terminology: Covers technical vocabulary from banking, insurance, financial services, compliance, investment, and fintech.

•
Authentic Industry Language: Captures real-world usage, including expressions from customer service conversations, financial reporting, and policy documentation.

•
Contextual Coverage: Draws content from scenarios such as:

•Banking transactions and statements
•Risk management reports
•Compliance policies
•Claims processing and customer support dialogs
•
Cross-Domain Elements: Includes supporting vocabulary from general business, legal, and technology domains, relevant to modern BFSI operations.

Format and Structure
•
File Formats: Delivered in Excel format by default, with easy conversion to JSON, TMX, XML, XLIFF, XLS, and other widely supported industry formats.

•
Dataset Structure: Serial Number, Unique ID, Source Sentence and Source Word Count, Target Sentence and Target Word Count

Usage and Applications
•
Machine Translation and Localization: Supports training of accurate translation models and localization systems specific to the BFSI sector.

•
NLP Systems: Useful for enhancing tools such as grammar checkers, spell checkers, predictive text, and speech/text understanding engines.

•
Large Language Models (LLMs): Enables fine-tuning and bilingual enhancement of LLMs for:

•Financial content generation
•Summarization of market reports
•Automated responses to customer service and
h
Tamil-Finetuning-data
huggingface.co
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thrisha Sivasakthi (2025). Tamil-Finetuning-data [Dataset]. https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2025
Authors
Thrisha Sivasakthi
Description
Dataset Card for Dataset Name

This dataset is designed for fine-tuning Large Language Models (LLMs) in Tamil, enabling them to understand and generate high-quality Tamil text across multiple domains. It contains 72,000 curated and generated samples, ensuring a rich linguistic diversity that improves model generalization. 🔹 Sources: Kaggle Tamil NLP, Sentiment Analysis datasets, and synthetic data. 🔹 Languages: Tamil, Tanglish (Tamil-English mix), and regional Tamil dialects. 🔹… See the full description on the dataset page: https://huggingface.co/datasets/ThrishaSivasakthi/Tamil-Finetuning-data.
203 Hours Tamil Speech Dataset – Conversation & Monologue Audio
m.nexdata.ai
nexdata.ai
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 203 Hours Tamil Speech Dataset – Conversation & Monologue Audio [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1390?source=huggingface
Explore at:
Dataset updated
Jul 16, 2025
Dataset authored and provided by
Nexdata
Variables measured
Format, Country, Language, Accuracy Rate, Language(Region) Code, Recording environment, Features of annotation
Description
203 hours of real-world Tamil speech data featuring both casual conversations and scripted monologues. All audio was recorded from native Tamil speakers across various regions, reflecting real-world linguistic and acoustic diversity. Each sample is manually transcribed and annotated with speaker ID, gender, and other metadata, making it highly suitable for automatic speech recognition (ASR), speech synthesis (TTS), speaker identification, and natural language processing (NLP) applications. The dataset has been validated by leading AI companies and is particularly valuable for training robust AI models for underrepresented languages. All data collection, processing, and usage comply strictly with global data privacy laws including GDPR, CCPA, and PIPL, ensuring legal and ethical use.
m
Data from: MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie...
data.mendeley.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arunmozhi Mourougappane (2025). MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie Reviews in Tamil) [Dataset]. http://doi.org/10.17632/p59cfx4vx6.2
Explore at:
Unique identifier
https://doi.org/10.17632/p59cfx4vx6.2
Dataset updated
Apr 14, 2025
Authors
Arunmozhi Mourougappane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a carefully selected set of Tamil film reviews with the goal of advancing NLP research in the areas of text classification, sentiment analysis, and aspect-based sentiment analysis. We have invited users to review twenty-five films using a Google form. Additional reviews were taken from websites such as IMDb and YouTube. From the list of selected aspects, we also made sure that the review collection was based on the presence of at least one target aspect, including cinematography, acting, screenplay, story, director, songs, background music, and editing. About 1,390 reviews total, tagged for positive as well as negative views across eight different categories, make up the dataset.
Ponniyan selvan Tamil Book for NLP
kaggle.com
zip
Updated Sep 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/datasets/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp/discussion
Explore at:
zip(1985053 bytes)Available download formats
Dataset updated
Sep 9, 2020
Authors
Dinesh Kumar Sarangapani
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Dinesh Kumar Sarangapani

Released under CC0: Public Domain

Contents
Claim Detection and Matching for Indian Languages
zenodo.org
data.niaid.nih.gov
+1more
csv
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4890950
Dataset updated
Jun 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.
, etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.
Tamil Words Frequency
kaggle.com
zip
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laaveshwaran Parthiban (2023). Tamil Words Frequency [Dataset]. https://www.kaggle.com/datasets/aviiciii/tamil-words-frequency
Explore at:
zip(42398729 bytes)Available download formats
Dataset updated
Jul 3, 2023
Authors
Laaveshwaran Parthiban
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Github Repo: https://github.com/aviiciii/tamil-word-frequency

This is output data of the above project done to analyse the frequency of words in tamil language from various sources of data.

The Kaggle dataset used in this repository provides a comprehensive collection of Tamil language text data for Natural Language Processing (NLP) tasks. The dataset has been curated and compiled from various sources, and it serves as a valuable resource for linguistic research, NLP model training, and data analysis in the Tamil language domain.

The dataset contains a significant amount of Tamil language text, offering a diverse range of topics and genres. It includes a variety of text sources, such as news articles, books, blogs, social media content, and more. The texts cover a wide spectrum of vocabulary, enabling researchers and practitioners to explore different aspects of the Tamil language.
F
Tamil Call Center Data for Realestate AI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-tamil-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This Tamil Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Tamil -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
Speech Data
The dataset features 30 hours of dual-channel call center recordings between native Tamil speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
•Participant Diversity:
•
Speakers: 60 native Tamil speakers from our verified contributor community.

•
Regions: Representing different regions across Tamil Nadu to ensure accent and dialect variation.

•
Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted agent-customer discussions.

•
Call Duration: Average 5–15 minutes per call.

•
Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.

•
Recording Environment: Captured in noise-free and echo-free conditions.

Topic Diversity
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
•Inbound Calls:
•Property Inquiries
•Rental Availability
•Renovation Consultation
•Property Features & Amenities
•Investment Property Evaluation
•Ownership History & Legal Info, and more
•Outbound Calls:
•New Listing Notifications
•Post-Purchase Follow-ups
•Property Recommendations
•Value Updates
•Customer Satisfaction Surveys, and others
Such domain-rich variety ensures model generalization across common real estate support conversations.
Transcription
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., background noise, pauses)
•High transcription accuracy with word error rate below 5% via dual-layer human review.
These transcriptions streamline ASR and NLP development for Tamil real estate voice applications.
Metadata
Detailed metadata accompanies each participant and conversation:
•
Participant Metadata: ID, age, gender, location, accent, and dialect.

•
Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

This enables smart filtering, dialect-focused model training, and structured dataset exploration.
Usage and Applications
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
<span
h
tamil-english-colloquial-translations
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandhini Varadharajan, tamil-english-colloquial-translations [Dataset]. https://huggingface.co/datasets/nandhinivaradharajan14/tamil-english-colloquial-translations
Explore at:
Authors
Nandhini Varadharajan
Description
Dataset: English-Tamil (en-ta) Parallel Corpus

This dataset contains parallel sentences in English and Tamil (en-ta) that have been curated from multiple sources. It is designed for tasks such as machine translation, language modeling, and other natural language processing (NLP) applications involving English and Tamil.

Dataset Composition

The dataset is composed of three main parts, which have been concatenated into a single file with two columns: ta (Tamil) and… See the full description on the dataset page: https://huggingface.co/datasets/nandhinivaradharajan14/tamil-english-colloquial-translations.
E
Data from: EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)
live.european-language-grid.eu
binary format
Updated Oct 30, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1085
Explore at:
binary formatAvailable download formats
Dataset updated
Oct 30, 2014
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
h
indic-nlp
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Bagarua (2025). indic-nlp [Dataset]. https://huggingface.co/datasets/ayushbagaria17/indic-nlp
Explore at:
Dataset updated
Jun 12, 2025
Authors
Ayush Bagarua
Description
L3Cube-IndicNews

L3Cube-IndicNews, is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 11 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, Punjabi and English. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct… See the full description on the dataset page: https://huggingface.co/datasets/ayushbagaria17/indic-nlp.
TamilSentiMix NLP
kaggle.com
zip
Updated Oct 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Arvidsson (2023). TamilSentiMix NLP [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/tamilsentimix
Explore at:
zip(1891 bytes)Available download formats
Dataset updated
Oct 18, 2023
Authors
Joakim Arvidsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

Introductory Paper

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

By Bharathi Raja Chakravarthi, V. Muralidaran, R. Priyadharshini, John P. McCrae. 2020

Published in Workshop on Spoken Language Technologies for Under-resourced Languages
m
Indian Agent to Indian Customer call center Speech Dataset in Tamil
data.macgence.com
mp3
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil
Explore at:
mp3Available download formats
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide, India
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore Macgence's Tamil speech dataset of Indian agent-customer call center conversations—ideal for ASR, NLP, and voice AI training applications.
m
Indian Agent to Indian Customer call center Speech Dataset in Tamil for...
data.macgence.com
mp3
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Finance [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-finance
Explore at:
mp3Available download formats
Dataset updated
Mar 17, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide, India
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
High-quality Tamil speech dataset featuring Indian agent-customer finance calls, ideal for ASR, NLP, and voice AI model training.
Tamil - Language Corpus for NLP
kaggle.com
zip
Updated Apr 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen (2020). Tamil - Language Corpus for NLP [Dataset]. https://www.kaggle.com/datasets/praveengovi/tamil-language-corpus-for-nlp/discussion
Explore at:
zip(2440281343 bytes)Available download formats
Dataset updated
Apr 8, 2020
Authors
Praveen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
https://cms.qz.com/wp-content/uploads/2017/04/tamil.jpg?quality=75&strip=all&w=1400" alt="">

Context

Tamil is one of the longest-surviving classical languages in the world.It described as "the only language of contemporary India which is recognizably continuous with a classical past. The variety and quality of classical Tamil literature has led to it being described as "one of the great classical traditions and literatures of the world".

Tamil language Corpus helps researches,IT professionals and students to create tamil language models for classifying sentiments , Topic modeling , text summarisation , text generation ,Named Entity recognition ,Knowledge graph and Chatbot

Content

Tamil language Corpus consist of articles from Wikipedia & Tamil daily news , Dataset split into train and test for ease of use in building machine learning models

Acknowledgements

Thanks to Vanagamudi and Gaurov for contribution to tamil NLP and dataset used for their NLP is really helpful to prepare this dataset

https://github.com/vanangamudi/tamil-lm2 https://github.com/goru001/nlp-for-tamil

Inspiration

Evolving the tamil language in Artificial Intelligence world & contribute to education and research
F
English-Tamil Parallel Corpus for the Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Tamil Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English-Tamil Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
Dataset Content
•Volume and Translator Diversity
•Sentence Count: Over 50,000 bilingual sentence pairs
•Translator Base: More than 200 native Tamil linguists with domain familiarity contributed to the translation process
•Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
•Sentence Variety
•Length Range: Sentences contain 7 to 25 words
•Grammatical Structures: Includes simple, compound, and complex sentences
•Form Types: Covers questions, commands, affirmations, and negations
•Voice Representation: Balanced use of active and passive sentence constructions
•Cross Translation: Dataset includes both English-to-Tamil and Tamil-to-English segments to ensure bidirectional support
•Linguistic Features:
•Idiomatic expressions and legal jargon
•Sentence connectors and discourse markers to preserve argument structure and legal reasoning
Legal Domain Specialization
•Legal Terminology Coverage
This dataset includes terminology across a wide range of legal subdomains such as:
•Contracts, agreements, and commercial law
•Criminal and civil litigation
•Legal procedures, rulings, and statutory interpretation
•Administrative, constitutional, and regulatory terms
•Courtroom dialogue, judgments, and legal advisories
•Contextual Diversity
Sentence pairs are drawn from realistic legal content types, including:
•Legal briefs, affidavits, and memoranda
•Terms of service and data protection policies
•Research articles and legal scholarship
•Standard forms and templates
•Legislative, policy, and compliance language
•Cross-Domain Elements
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
•Government policy
•Business and finance
•Technology, IP, and cybersecurity law
Format and Structure
•
Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats

•Included Fields:
•Serial Number
•Unique ID
•Source Sentence and Word Count
•Target Sentence and Word Count
Use Cases and Applications
•
Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation

•
Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines
m
Indian Agent to Indian Customer call center Speech Dataset in Tamil for...
data.macgence.com
mp3
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Indian Agent to Indian Customer call center Speech Dataset in Tamil for Banking [Dataset]. https://data.macgence.com/dataset/indian-agent-to-indian-customer-call-center-speech-dataset-in-tamil-for-banking
Explore at:
mp3Available download formats
Dataset updated
Mar 21, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide, India
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Explore authentic Tamil call center speech data for banking, featuring Indian agents and customers. Curated by Macgence for voice AI and NLP projects.

Facebook

Twitter

Click to copy link

Link copied

Cite

Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles

Tamil (Tamizh) Wikipedia Text Dataset for NLP

Building a High-Resource Future for Tamil in NLP: Collaborative Efforts for Data

Explore at:

zip(339341289 bytes)Available download formats

Dataset updated

Nov 12, 2024

Authors

Younus_Mohamed

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

How You Can Use This Dataset

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

Clear search

Close search

Google apps

Main menu

Tamil (Tamizh) Wikipedia Text Dataset for NLP

What’s Included

Why This Dataset?

** How You Can Use This Dataset**

Let’s Collaborate!

License

Tamil NLP

Context

Content

Acknowledgements

Inspiration

Tamil and Tanglish YT Video Transcripts for NLP

Key Features

Columns

Use Cases

Acknowledgment

Tags

License

English-Tamil Parallel Corpus for the BFSI Domain

Introduction

Dataset Content

Domain-Specific Content

Format and Structure

Usage and Applications

Tamil-Finetuning-data

203 Hours Tamil Speech Dataset – Conversation & Monologue Audio

Data from: MADTRAS (Dataset for Aspect-based Sentiment Analysis of Movie...

Ponniyan selvan Tamil Book for NLP

Dataset

Contents

Claim Detection and Matching for Indian Languages

Tamil Words Frequency

Tamil Call Center Data for Realestate AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

tamil-english-colloquial-translations

Data from: EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

indic-nlp

TamilSentiMix NLP

Indian Agent to Indian Customer call center Speech Dataset in Tamil

Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

Tamil - Language Corpus for NLP

Context

Content

Acknowledgements

Inspiration

English-Tamil Parallel Corpus for the Legal Domain

Introduction

Dataset Content

Legal Domain Specialization

Format and Structure

Use Cases and Applications

Indian Agent to Indian Customer call center Speech Dataset in Tamil for...

Tamil (Tamizh) Wikipedia Text Dataset for NLP

Building a High-Resource Future for Tamil in NLP: Collaborative Efforts for Data

What’s Included

Why This Dataset?

** How You Can Use This Dataset**

Let’s Collaborate!

License

How You Can Use This Dataset

How You Can Use This Dataset