67 datasets found

Tamil NLP
kaggle.com
Updated Mar 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/sudalairajkumar/tamil-nlp/code
Explore at:
Dataset updated
Mar 11, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SRK
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

tamil_news_train.csv - Train dataset for tamil news classification.

tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews

tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

tamil_thirukkural_train - train dataset having 1064 rows

tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

Malaikannan for starting this initiative

Selvakumar for getting the data

Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

Can we do text classification for Tamil languages and get good accuracies similar to other languages?

How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.
F
Tamil-English translated Parallel Corpora for Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil-English translated Parallel Corpora for Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Legal domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,
The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,
These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,
More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!
F
Tamil (India) General Conversation Speech Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-tamil-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Tamil Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.
Speech Data:
This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Tamil speakers from different part of Tamil Nadu. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil language speech recognition models.
Transcription:
This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of Tamil language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
Ponniyan selvan Tamil Book for NLP
kaggle.com
zip
Updated Sep 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
Explore at:
zip(1985053 bytes)Available download formats
Dataset updated
Sep 9, 2020
Authors
Dinesh Kumar Sarangapani
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Dinesh Kumar Sarangapani

Released under CC0: Public Domain

Contents
k
Tamil---Language-Corpus-for-NLP
kaggle.com
Updated Mar 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Tamil---Language-Corpus-for-NLP [Dataset]. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
தமிழ் மொழி கார்பஸ் தரவுத்தொகுப்பு - இயற்கை மொழி செயலாக்கம்
Claim Detection and Matching for Indian Languages
zenodo.org
explore.openaire.eu
csv
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4890950
Dataset updated
Jun 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.
, etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.
F
Tamil-English translated Parallel Corpora for Management Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil-English translated Parallel Corpora for Management Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-management-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Management domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,
The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,
These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,
More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!
E
GlobalPhone Bulgarian
live.european-language-grid.eu
catalogue.elra.info
audio format
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Bulgarian [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2097
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Bulgarian part of GlobalPhone was collected in 2005 in the cities of Sofia and Pazardzhik, Bulgaria. All speakers are Bulgarian native speakers from the west and central part of Bulgaria. Data was collected from 77 speakers in total, of which 45 were female and 32 were male. The majority of speakers are well educated, being graduated students, construction engineers, and teachers. The age distribution of the speakers ranges from 18 to 65 years. Of all speakers, 62 reported to be non-smokers, 15 are smokers, no further information about health status is provided. Each speaker read on average about 112 utterances from newspaper articles, corresponding to roughly 16.6 minutes of speech or 1940 words per person, in total we recorded 8674 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with low background noise, while one speaker was recorded in a public place. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for recording mainly came from the news posted in online editions of three national Bulgarian newspaper websites as listed below. About 350 articles with more than 10,000 sentences were downloaded and processed (manually edited to normalize and clean the text, resolve abbreviations and numbers). We followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). In sum, 8674 utterances were spoken, corresponding to 21.4 hours of speech or 150,000 spoken words in total, covering a vocabulary of 23,000 words. The transcriptions are provided in Bulgarian script (Cyrillic) in UTF-8 encoding. The Bulgarian data are organized in a training set of 63 speakers, a development set of 7 speakers (spk IDs 051, 055, 058, 084, 090, 100, 106), and an evaluation set of 7 speakers (spk IDs 040, 059, 063, 068, 095, 109, 110).

Bulgarian Newspaper sources: Banker: http://www.banker.bg Kesh: http://www.cash.bg Sega: http://www.segabg.com

[Mircheva 2006] Aneliya Mircheva (2006): Bulgarian Speech Recognition and Multilingual Language Modeling, Project Term (Studienarbeit), Institute for Theoretical Informatics, University Karlsruhe. [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.
d
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND
b2find.dkrz.de
Updated Apr 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
Explore at:
Dataset updated
Apr 28, 2023
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.
XQA Tamil
kaggle.com
zip
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manav Dhamani (2021). XQA Tamil [Dataset]. https://www.kaggle.com/mdhamani/xqa-tamil
Explore at:
zip(11676466 bytes)Available download formats
Dataset updated
Sep 30, 2021
Authors
Manav Dhamani
Description
Dataset

This dataset was created by Manav Dhamani

Contents
d
TAUS Language Translation Data | Parallel translation for Colloquial English...
datarade.ai
Updated Dec 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TAUS (2020). TAUS Language Translation Data | Parallel translation for Colloquial English into various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-colloquial-domain-english-low-resource-see-description-taus
Explore at:
.xml, .csv, .xls, .txtAvailable download formats
Dataset updated
Dec 15, 2020
Dataset authored and provided by
TAUS
Area covered
Bangladesh, Lao People's Democratic Republic, Myanmar, Nepal, Iraq, Iran (Islamic Republic of), Turkey, Vietnam, Timor-Leste, Indonesia
Description
The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab)

Other languages are available on demand.
F
Travel domain Human-Human conversation chats in Tamil
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Travel domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-travel-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the travel domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,
This training dataset's licence belongs to FutureBeeAI!

IndicCorp Dataset

paperswithcode.com
opendatalab.com

Updated Mar 10, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp

Explore at:

Dataset updated

Mar 10, 2024

Authors

Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.

Description

IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

Language	# News Articles*	Sentences	Tokens	Link
as	0.60M	1.39M	32.6M	link
bn	3.83M	39.9M	836M	link
en	3.49M	54.3M	1.22B	link
gu	2.63M	41.1M	719M	link
hi	4.95M	63.1M	1.86B	link
kn	3.76M	53.3M	713M	link
ml	4.75M	50.2M	721M	link
mr	2.31M	34.0M	551M	link
or	0.69M	6.94M	107M	link
pa	2.64M	29.2M	773M	link
ta	4.41M	31.5M	582M	link
te	3.98M	47.9M	674M	link

Excluding articles obtained from the OSCAR corpus

HPL Tamil Dataset
kaggle.com
zip
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Khadka (2024). HPL Tamil Dataset [Dataset]. https://www.kaggle.com/datasets/rohitkhadka375741/hpl-tamil
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 9, 2024
Authors
Rohit Khadka
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
"HPL Tamil" dataset serves as a valuable resource for anyone interested in studying and analyzing the Tamil language, facilitating advancements in computational linguistics and NLP research.
F
General Domain Scripted Monologue Speech Data: Tamil (India)
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General Domain Scripted Monologue Speech Data: Tamil (India) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-tamil-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Tamil Language Scripted Monologue Speech Dataset, a comprehensive and diverse collection of single utterance voice data specifically designed to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents.,
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.,
Speech Data:,
This training dataset consists of 5000+ high-quality scripted single-sentence recordings in the Tamil Language. These sentences contain various elements like person names, organization names, currencies, dates, times, locations, and more, which makes them very useful for developing robust natural language processing algorithms.,
This dataset contains the speech voices of 40 native Tamil speakers from different parts of Tamilnadu. This collaborative effort guarantees a balanced representation of Indian accents and demographics, reducing biases and promoting inclusivity.,
The average duration of each audio recording is around 5-30 seconds. The speech data is available in WAV format, with monochannel files having a bit depth of 16 bits and a sample rate of 48 kHz. The recording environment is generally quiet, without background noise and echo.,
Metadata:,
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device details, bit depth, and sample rate will be provided.,
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil speech recognition models.,
Transcription (Text File):,
This dataset provides text files containing scripted prompts along with each audio file. The transcription is available in TXT file format with proper renaming corresponding to its audio file.,
This text data can further be annotated with named entity recognition (NER) to expedite the deployment of Tamil conversational AI and NLP models.,
Updates and Customization:,
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.,
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario or with different speaking speeds like fast, slow or normal, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8 kHz to 48 kHz, allowing you to fine-tune your models for different audio recording setups.,
License:,
This audio dataset, created by FutureBeeAI, is now available for commercial use.,
Conclusion:,
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring speech AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
E
Data from: Tamil Dependency Treebank v0.1
live.european-language-grid.eu
lindat.mff.cuni.cz
binary format
Updated Oct 30, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Tamil Dependency Treebank v0.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1084
Explore at:
binary formatAvailable download formats
Dataset updated
Oct 30, 2014
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Tamil Dependency Treebank version 0.1 (TamilTB.v0.1) is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB.v0.1 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.
h
offenseval_dravidian
huggingface.co
opendatalab.com
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The HF Datasets community (2023). offenseval_dravidian [Dataset]. https://huggingface.co/datasets/offenseval_dravidian
Explore at:
Dataset updated
Oct 19, 2023
Dataset authored and provided by
The HF Datasets community
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Offensive language identification in dravidian lanaguages dataset. The goal of this task is to identify offensive language content of the code-mixed dataset of comments/posts in Dravidian Languages ( (Tamil-English, Malayalam-English, and Kannada-English)) collected from social media.
P
IndicGLUE Dataset
paperswithcode.com
Updated Mar 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2022). IndicGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/indicglue
Explore at:
Dataset updated
Mar 9, 2022
Authors
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
Description
We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.
m
Transliteration Sentence Dataset
data.mendeley.com
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Jabed Hosen (2024). Transliteration Sentence Dataset [Dataset]. http://doi.org/10.17632/38y7g2fcny.1
Explore at:
Unique identifier
https://doi.org/10.17632/38y7g2fcny.1
Dataset updated
Apr 16, 2024
Authors
Md. Jabed Hosen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A transliteration sentence is like writing the same words but using different letters that sound the same. It helps people who speak different languages understand each other better. This dataset, drawn from 12 varied datasets initially intended for tasks such as sentiment analysis, hate speech detection, social media analysis, and review classification, endeavors to encompass a wide array of linguistic subtleties and fluctuations inherent in real-world language usage. Each data instance was meticulously labeled based on the language of the sentences. From this amalgamation of datasets, we curated a dataset comprising 65,473 instances, comprising 19,859 Bangla, 17,309 Hindi, 17,000 English, and 11,305 Tamil data instances, specifically tailored for transliteration sentence identification.
Hindi and Tamil Wiki text cleaned
kaggle.com
zip
Updated Sep 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
makaveli (2021). Hindi and Tamil Wiki text cleaned [Dataset]. https://www.kaggle.com/starkking07/hindi-and-tamil-wiki-text-cleaned
Explore at:
zip(253902101 bytes)Available download formats
Dataset updated
Sep 5, 2021
Authors
makaveli
Description
Dataset

This dataset was created by makaveli

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/sudalairajkumar/tamil-nlp/code

Tamil NLP

Datasets for Natural Language Processing in Tamil

Explore at:

24 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 11, 2019

Dataset provided by

Kagglehttp://kaggle.com/

Authors

SRK

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

tamil_news_train.csv - Train dataset for tamil news classification.
tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

tamil_thirukkural_train - train dataset having 1064 rows
tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

Malaikannan for starting this initiative
Selvakumar for getting the data
Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

Can we do text classification for Tamil languages and get good accuracies similar to other languages?
How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.

Clear search

Close search

Google apps

Main menu

Tamil NLP

Context

Content

Acknowledgements

Inspiration

Tamil-English translated Parallel Corpora for Legal Domain

Tamil (India) General Conversation Speech Dataset

Ponniyan selvan Tamil Book for NLP

Dataset

Contents

Tamil---Language-Corpus-for-NLP

Claim Detection and Matching for Indian Languages

Tamil-English translated Parallel Corpora for Management Domain

GlobalPhone Bulgarian

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

XQA Tamil

Dataset

Contents

TAUS Language Translation Data | Parallel translation for Colloquial English...

Travel domain Human-Human conversation chats in Tamil

IndicCorp Dataset

HPL Tamil Dataset

General Domain Scripted Monologue Speech Data: Tamil (India)

Data from: Tamil Dependency Treebank v0.1

offenseval_dravidian

IndicGLUE Dataset

Transliteration Sentence Dataset

Hindi and Tamil Wiki text cleaned

Dataset

Contents

Tamil NLP

Datasets for Natural Language Processing in Tamil

Context

Content

Acknowledgements

Inspiration