70 datasets found

Tamil NLP
kaggle.com
Updated Mar 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/sudalairajkumar/tamil-nlp/code
Explore at:
Dataset updated
Mar 11, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SRK
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

tamil_news_train.csv - Train dataset for tamil news classification.

tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews

tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

tamil_thirukkural_train - train dataset having 1064 rows

tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

Malaikannan for starting this initiative

Selvakumar for getting the data

Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

Can we do text classification for Tamil languages and get good accuracies similar to other languages?

How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.
F
Tamil-English translated Parallel Corpora for Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil-English translated Parallel Corpora for Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Legal domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,
The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,
These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,
More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!
Ponniyan selvan Tamil Book for NLP
kaggle.com
zip
Updated Sep 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/datasets/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
Explore at:
zip(1985053 bytes)Available download formats
Dataset updated
Sep 9, 2020
Authors
Dinesh Kumar Sarangapani
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Dinesh Kumar Sarangapani

Released under CC0: Public Domain

Contents
Claim Detection and Matching for Indian Languages
zenodo.org
explore.openaire.eu
csv
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4890950
Dataset updated
Jun 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.
, etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.
F
Tamil (India) General Conversation Speech Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-tamil-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Tamil Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.
Speech Data:
This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Tamil speakers from different part of Tamil Nadu. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil language speech recognition models.
Transcription:
This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of Tamil language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
E
GlobalPhone Tamil
live.european-language-grid.eu
catalogue.elra.info
audio format
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Tamil [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1916
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Tamil corpus was produced using the Thinaboomi Tamil Daily newspaper. It contains recordings of 47 speakers (gender unspecified) recorded in India. No age distribution is available.
d
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND
b2find.dkrz.de
Updated Apr 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
Explore at:
Dataset updated
Apr 28, 2023
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
k
Tamil---Language-Corpus-for-NLP
kaggle.com
Updated Mar 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Tamil---Language-Corpus-for-NLP [Dataset]. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
தமிழ் மொழி கார்பஸ் தரவுத்தொகுப்பு - இயற்கை மொழி செயலாக்கம்
F
Tamil-English translated Parallel Corpora for Management Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil-English translated Parallel Corpora for Management Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-management-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Management domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,
The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,
These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,
More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!
d
TAUS Language Translation Data | Parallel translation for Colloquial English...
datarade.ai
Updated Dec 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TAUS (2020). TAUS Language Translation Data | Parallel translation for Colloquial English into various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-colloquial-domain-english-low-resource-see-description-taus
Explore at:
.xml, .csv, .xls, .txtAvailable download formats
Dataset updated
Dec 15, 2020
Dataset authored and provided by
TAUS
Area covered
Bangladesh, Myanmar, Lao People's Democratic Republic, Turkey, Vietnam, Timor-Leste, Iraq, Iran (Islamic Republic of), Nepal, Indonesia
Description
The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab)

Other languages are available on demand.

IndicCorp Dataset

paperswithcode.com
opendatalab.com

Updated Mar 10, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp

Explore at:

Dataset updated

Mar 10, 2024

Authors

Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.

Description

IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

Language	# News Articles*	Sentences	Tokens	Link
as	0.60M	1.39M	32.6M	link
bn	3.83M	39.9M	836M	link
en	3.49M	54.3M	1.22B	link
gu	2.63M	41.1M	719M	link
hi	4.95M	63.1M	1.86B	link
kn	3.76M	53.3M	713M	link
ml	4.75M	50.2M	721M	link
mr	2.31M	34.0M	551M	link
or	0.69M	6.94M	107M	link
pa	2.64M	29.2M	773M	link
ta	4.41M	31.5M	582M	link
te	3.98M	47.9M	674M	link

Excluding articles obtained from the OSCAR corpus

HPL Tamil Dataset
kaggle.com
zip
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Khadka (2024). HPL Tamil Dataset [Dataset]. https://www.kaggle.com/datasets/rohitkhadka375741/hpl-tamil
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 9, 2024
Authors
Rohit Khadka
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
"HPL Tamil" dataset serves as a valuable resource for anyone interested in studying and analyzing the Tamil language, facilitating advancements in computational linguistics and NLP research.
XQA Tamil
kaggle.com
zip
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manav Dhamani (2021). XQA Tamil [Dataset]. https://www.kaggle.com/mdhamani/xqa-tamil
Explore at:
zip(11676466 bytes)Available download formats
Dataset updated
Sep 30, 2021
Authors
Manav Dhamani
Description
Dataset

This dataset was created by Manav Dhamani

Contents
Tamil - Language Corpus for NLP
kaggle.com
Updated Apr 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen (2020). Tamil - Language Corpus for NLP [Dataset]. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp/discussion
Explore at:
Dataset updated
Apr 8, 2020
Dataset provided by
Kaggle
Authors
Praveen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
https://cms.qz.com/wp-content/uploads/2017/04/tamil.jpg?quality=75&strip=all&w=1400" alt="">

Context

Tamil is one of the longest-surviving classical languages in the world.It described as "the only language of contemporary India which is recognizably continuous with a classical past. The variety and quality of classical Tamil literature has led to it being described as "one of the great classical traditions and literatures of the world".

Tamil language Corpus helps researches,IT professionals and students to create tamil language models for classifying sentiments , Topic modeling , text summarisation , text generation ,Named Entity recognition ,Knowledge graph and Chatbot

Content

Tamil language Corpus consist of articles from Wikipedia & Tamil daily news , Dataset split into train and test for ease of use in building machine learning models

Acknowledgements

Thanks to Vanagamudi and Gaurov for contribution to tamil NLP and dataset used for their NLP is really helpful to prepare this dataset

https://github.com/vanangamudi/tamil-lm2 https://github.com/goru001/nlp-for-tamil

Inspiration

Evolving the tamil language in Artificial Intelligence world & contribute to education and research
Tamil Wikipedia Articles
kaggle.com
Updated Dec 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav (2019). Tamil Wikipedia Articles [Dataset]. https://www.kaggle.com/disisbig/tamil-wikipedia-articles/notebooks
Explore at:
Dataset updated
Dec 25, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This data set consists of 127k Wikipedia Articles which have been cleaned.

It has a Train set and Validation set, which were used to train and benchmark Language Models for Tamil in the repository NLP for Tamil

The scripts which were used to fetch and clean articles can be found here

Thanks to Ravi for sharing this data set

Feel free to use this data set creatively and for building better Language Models
F
Travel domain Human-Human conversation chats in Tamil
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Travel domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-travel-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the travel domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,
This training dataset's licence belongs to FutureBeeAI!
E
Data from: Tamil Dependency Treebank v0.1
live.european-language-grid.eu
lindat.mff.cuni.cz
binary format
Updated Oct 30, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Tamil Dependency Treebank v0.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1084
Explore at:
binary formatAvailable download formats
Dataset updated
Oct 30, 2014
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Tamil Dependency Treebank version 0.1 (TamilTB.v0.1) is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB.v0.1 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.
h
offenseval_dravidian
huggingface.co
opendatalab.com
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The HF Datasets community (2023). offenseval_dravidian [Dataset]. https://huggingface.co/datasets/offenseval_dravidian
Explore at:
Dataset updated
Oct 19, 2023
Dataset authored and provided by
The HF Datasets community
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Offensive language identification in dravidian lanaguages dataset. The goal of this task is to identify offensive language content of the code-mixed dataset of comments/posts in Dravidian Languages ( (Tamil-English, Malayalam-English, and Kannada-English)) collected from social media.
m
Transliteration Sentence Dataset
data.mendeley.com
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Jabed Hosen (2024). Transliteration Sentence Dataset [Dataset]. http://doi.org/10.17632/38y7g2fcny.1
Explore at:
Unique identifier
https://doi.org/10.17632/38y7g2fcny.1
Dataset updated
Apr 16, 2024
Authors
Md. Jabed Hosen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A transliteration sentence is like writing the same words but using different letters that sound the same. It helps people who speak different languages understand each other better. This dataset, drawn from 12 varied datasets initially intended for tasks such as sentiment analysis, hate speech detection, social media analysis, and review classification, endeavors to encompass a wide array of linguistic subtleties and fluctuations inherent in real-world language usage. Each data instance was meticulously labeled based on the language of the sentences. From this amalgamation of datasets, we curated a dataset comprising 65,473 instances, comprising 19,859 Bangla, 17,309 Hindi, 17,000 English, and 11,305 Tamil data instances, specifically tailored for transliteration sentence identification.
P
IndicGLUE Dataset
paperswithcode.com
Updated Mar 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2022). IndicGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/indicglue
Explore at:
Dataset updated
Mar 9, 2022
Authors
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
Description
We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.

Facebook

Twitter

Click to copy link

Link copied

Cite

IndicCorp Dataset

Explore at:

91 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 10, 2024

Authors

Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.

Description

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

Language	# News Articles*	Sentences	Tokens	Link
as	0.60M	1.39M	32.6M	link
bn	3.83M	39.9M	836M	link
en	3.49M	54.3M	1.22B	link
gu	2.63M	41.1M	719M	link
hi	4.95M	63.1M	1.86B	link
kn	3.76M	53.3M	713M	link
ml	4.75M	50.2M	721M	link
mr	2.31M	34.0M	551M	link
or	0.69M	6.94M	107M	link
pa	2.64M	29.2M	773M	link
ta	4.41M	31.5M	582M	link
te	3.98M	47.9M	674M	link

Excluding articles obtained from the OSCAR corpus

Clear search

Close search

Google apps

Main menu

Tamil NLP

Context

Content

Acknowledgements

Inspiration

Tamil-English translated Parallel Corpora for Legal Domain

Ponniyan selvan Tamil Book for NLP

Dataset

Contents

Claim Detection and Matching for Indian Languages

Tamil (India) General Conversation Speech Dataset

GlobalPhone Tamil

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

Tamil---Language-Corpus-for-NLP

Tamil-English translated Parallel Corpora for Management Domain

TAUS Language Translation Data | Parallel translation for Colloquial English...

IndicCorp Dataset

HPL Tamil Dataset

XQA Tamil

Dataset

Contents

Tamil - Language Corpus for NLP

Context

Content

Acknowledgements

Inspiration

Tamil Wikipedia Articles

Travel domain Human-Human conversation chats in Tamil

Data from: Tamil Dependency Treebank v0.1

offenseval_dravidian

Transliteration Sentence Dataset

IndicGLUE Dataset

IndicCorp Dataset