67 datasets found
  1. Tamil NLP

    • kaggle.com
    Updated Mar 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/sudalairajkumar/tamil-nlp/code
    Explore at:
    Dataset updated
    Mar 11, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SRK
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Indic NLP - Natural Language Processing for Indian Languages.

    This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

    Content

    The dataset has the following files.

    Tamil News Classficaition

    This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

    • tamil_news_train.csv - Train dataset for tamil news classification.
    • tamil_news_test.csv - Test dataset for tamil news classification

    Tamil Movie Review Dataset

    This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

    • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
    • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

    Thirukkural Dataset

    From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

    I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

    • tamil_thirukkural_train - train dataset having 1064 rows
    • tamil_thirukkural_test - test dataset having 266 rows

    Will add more datasets in the following versions.

    Acknowledgements

    My sincere thanks to :

    • Malaikannan for starting this initiative
    • Selvakumar for getting the data
    • Vijay Anand for the Thirukkural data

    Inspiration

    Some questions which can be answered are

    1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
    2. How does the Language models do for Tamil?

    And lot more interesting questions to be answered.

    Checkout this link to find similar and dissimilar words for Tamil.

  2. F

    Tamil-English translated Parallel Corpora for Legal Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil-English translated Parallel Corpora for Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Legal domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,

    The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,

    These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,

    More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!

  3. F

    Tamil (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Tamil Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.

    Speech Data:

    This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Tamil speakers from different part of Tamil Nadu. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Tamil language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  4. Ponniyan selvan Tamil Book for NLP

    • kaggle.com
    zip
    Updated Sep 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
    Explore at:
    zip(1985053 bytes)Available download formats
    Dataset updated
    Sep 9, 2020
    Authors
    Dinesh Kumar Sarangapani
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Dinesh Kumar Sarangapani

    Released under CC0: Public Domain

    Contents

  5. k

    Tamil---Language-Corpus-for-NLP

    • kaggle.com
    Updated Mar 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Tamil---Language-Corpus-for-NLP [Dataset]. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    தமிழ் மொழி கார்பஸ் தரவுத்தொகுப்பு - இயற்கை மொழி செயலாக்கம்

  6. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • explore.openaire.eu
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  7. F

    Tamil-English translated Parallel Corpora for Management Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil-English translated Parallel Corpora for Management Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-management-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Management domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,

    The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,

    These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,

    More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!

  8. E

    GlobalPhone Bulgarian

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Bulgarian [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2097
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Bulgarian part of GlobalPhone was collected in 2005 in the cities of Sofia and Pazardzhik, Bulgaria. All speakers are Bulgarian native speakers from the west and central part of Bulgaria. Data was collected from 77 speakers in total, of which 45 were female and 32 were male. The majority of speakers are well educated, being graduated students, construction engineers, and teachers. The age distribution of the speakers ranges from 18 to 65 years. Of all speakers, 62 reported to be non-smokers, 15 are smokers, no further information about health status is provided. Each speaker read on average about 112 utterances from newspaper articles, corresponding to roughly 16.6 minutes of speech or 1940 words per person, in total we recorded 8674 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with low background noise, while one speaker was recorded in a public place. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for recording mainly came from the news posted in online editions of three national Bulgarian newspaper websites as listed below. About 350 articles with more than 10,000 sentences were downloaded and processed (manually edited to normalize and clean the text, resolve abbreviations and numbers). We followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). In sum, 8674 utterances were spoken, corresponding to 21.4 hours of speech or 150,000 spoken words in total, covering a vocabulary of 23,000 words. The transcriptions are provided in Bulgarian script (Cyrillic) in UTF-8 encoding. The Bulgarian data are organized in a training set of 63 speakers, a development set of 7 speakers (spk IDs 051, 055, 058, 084, 090, 100, 106), and an evaluation set of 7 speakers (spk IDs 040, 059, 063, 068, 095, 109, 110).

    Bulgarian Newspaper sources: Banker: http://www.banker.bg Kesh: http://www.cash.bg Sega: http://www.segabg.com

    [Mircheva 2006] Aneliya Mircheva (2006): Bulgarian Speech Recognition and Multilingual Language Modeling, Project Term (Studienarbeit), Institute for Theoretical Informatics, University Karlsruhe. [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.

  9. d

    EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Apr 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
    Explore at:
    Dataset updated
    Apr 28, 2023
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  10. XQA Tamil

    • kaggle.com
    zip
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manav Dhamani (2021). XQA Tamil [Dataset]. https://www.kaggle.com/mdhamani/xqa-tamil
    Explore at:
    zip(11676466 bytes)Available download formats
    Dataset updated
    Sep 30, 2021
    Authors
    Manav Dhamani
    Description

    Dataset

    This dataset was created by Manav Dhamani

    Contents

  11. d

    TAUS Language Translation Data | Parallel translation for Colloquial English...

    • datarade.ai
    Updated Dec 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAUS (2020). TAUS Language Translation Data | Parallel translation for Colloquial English into various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-colloquial-domain-english-low-resource-see-description-taus
    Explore at:
    .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Dec 15, 2020
    Dataset authored and provided by
    TAUS
    Area covered
    Bangladesh, Lao People's Democratic Republic, Myanmar, Nepal, Iraq, Iran (Islamic Republic of), Turkey, Vietnam, Timor-Leste, Indonesia
    Description

    The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

    This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

    English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab)

    Other languages are available on demand.

  12. F

    Travel domain Human-Human conversation chats in Tamil

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Travel domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the travel domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

    This training dataset's licence belongs to FutureBeeAI!

  13. P

    IndicCorp Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp
    Explore at:
    Dataset updated
    Mar 10, 2024
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

    Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

    Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

    Downloads

    Language# News Articles*SentencesTokensLink
    as0.60M1.39M32.6Mlink
    bn3.83M39.9M836Mlink
    en3.49M54.3M1.22Blink
    gu2.63M41.1M719Mlink
    hi4.95M63.1M1.86Blink
    kn3.76M53.3M713Mlink
    ml4.75M50.2M721Mlink
    mr2.31M34.0M551Mlink
    or0.69M6.94M107Mlink
    pa2.64M29.2M773Mlink
    ta4.41M31.5M582Mlink
    te3.98M47.9M674Mlink
    • Excluding articles obtained from the OSCAR corpus
  14. HPL Tamil Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Khadka (2024). HPL Tamil Dataset [Dataset]. https://www.kaggle.com/datasets/rohitkhadka375741/hpl-tamil
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 9, 2024
    Authors
    Rohit Khadka
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    "HPL Tamil" dataset serves as a valuable resource for anyone interested in studying and analyzing the Tamil language, facilitating advancements in computational linguistics and NLP research.

  15. F

    General Domain Scripted Monologue Speech Data: Tamil (India)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General Domain Scripted Monologue Speech Data: Tamil (India) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Tamil Language Scripted Monologue Speech Dataset, a comprehensive and diverse collection of single utterance voice data specifically designed to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents.,

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.,

    Speech Data:,

    This training dataset consists of 5000+ high-quality scripted single-sentence recordings in the Tamil Language. These sentences contain various elements like person names, organization names, currencies, dates, times, locations, and more, which makes them very useful for developing robust natural language processing algorithms.,

    This dataset contains the speech voices of 40 native Tamil speakers from different parts of Tamilnadu. This collaborative effort guarantees a balanced representation of Indian accents and demographics, reducing biases and promoting inclusivity.,

    The average duration of each audio recording is around 5-30 seconds. The speech data is available in WAV format, with monochannel files having a bit depth of 16 bits and a sample rate of 48 kHz. The recording environment is generally quiet, without background noise and echo.,

    Metadata:,

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device details, bit depth, and sample rate will be provided.,

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil speech recognition models.,

    Transcription (Text File):,

    This dataset provides text files containing scripted prompts along with each audio file. The transcription is available in TXT file format with proper renaming corresponding to its audio file.,

    This text data can further be annotated with named entity recognition (NER) to expedite the deployment of Tamil conversational AI and NLP models.,

    Updates and Customization:,

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.,

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario or with different speaking speeds like fast, slow or normal, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8 kHz to 48 kHz, allowing you to fine-tune your models for different audio recording setups.,

    License:,

    This audio dataset, created by FutureBeeAI, is now available for commercial use.,

    Conclusion:,

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring speech AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  16. E

    Data from: Tamil Dependency Treebank v0.1

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Oct 30, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Tamil Dependency Treebank v0.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1084
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Oct 30, 2014
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Tamil Dependency Treebank version 0.1 (TamilTB.v0.1) is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB.v0.1 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.

  17. h

    offenseval_dravidian

    • huggingface.co
    • opendatalab.com
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The HF Datasets community (2023). offenseval_dravidian [Dataset]. https://huggingface.co/datasets/offenseval_dravidian
    Explore at:
    Dataset updated
    Oct 19, 2023
    Dataset authored and provided by
    The HF Datasets community
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Offensive language identification in dravidian lanaguages dataset. The goal of this task is to identify offensive language content of the code-mixed dataset of comments/posts in Dravidian Languages ( (Tamil-English, Malayalam-English, and Kannada-English)) collected from social media.

  18. P

    IndicGLUE Dataset

    • paperswithcode.com
    Updated Mar 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2022). IndicGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/indicglue
    Explore at:
    Dataset updated
    Mar 9, 2022
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.

  19. m

    Transliteration Sentence Dataset

    • data.mendeley.com
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Jabed Hosen (2024). Transliteration Sentence Dataset [Dataset]. http://doi.org/10.17632/38y7g2fcny.1
    Explore at:
    Dataset updated
    Apr 16, 2024
    Authors
    Md. Jabed Hosen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A transliteration sentence is like writing the same words but using different letters that sound the same. It helps people who speak different languages understand each other better. This dataset, drawn from 12 varied datasets initially intended for tasks such as sentiment analysis, hate speech detection, social media analysis, and review classification, endeavors to encompass a wide array of linguistic subtleties and fluctuations inherent in real-world language usage. Each data instance was meticulously labeled based on the language of the sentences. From this amalgamation of datasets, we curated a dataset comprising 65,473 instances, comprising 19,859 Bangla, 17,309 Hindi, 17,000 English, and 11,305 Tamil data instances, specifically tailored for transliteration sentence identification.

  20. Hindi and Tamil Wiki text cleaned

    • kaggle.com
    zip
    Updated Sep 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    makaveli (2021). Hindi and Tamil Wiki text cleaned [Dataset]. https://www.kaggle.com/starkking07/hindi-and-tamil-wiki-text-cleaned
    Explore at:
    zip(253902101 bytes)Available download formats
    Dataset updated
    Sep 5, 2021
    Authors
    makaveli
    Description

    Dataset

    This dataset was created by makaveli

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/sudalairajkumar/tamil-nlp/code
Organization logo

Tamil NLP

Datasets for Natural Language Processing in Tamil

Explore at:
24 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 11, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SRK
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

  • tamil_news_train.csv - Train dataset for tamil news classification.
  • tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

  • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
  • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

  • tamil_thirukkural_train - train dataset having 1064 rows
  • tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

  • Malaikannan for starting this initiative
  • Selvakumar for getting the data
  • Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

  1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
  2. How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.

Search
Clear search
Close search
Google apps
Main menu