70 datasets found
  1. Tamil NLP

    • kaggle.com
    Updated Mar 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/sudalairajkumar/tamil-nlp/code
    Explore at:
    Dataset updated
    Mar 11, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SRK
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Indic NLP - Natural Language Processing for Indian Languages.

    This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

    Content

    The dataset has the following files.

    Tamil News Classficaition

    This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

    • tamil_news_train.csv - Train dataset for tamil news classification.
    • tamil_news_test.csv - Test dataset for tamil news classification

    Tamil Movie Review Dataset

    This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

    • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
    • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

    Thirukkural Dataset

    From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

    I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

    • tamil_thirukkural_train - train dataset having 1064 rows
    • tamil_thirukkural_test - test dataset having 266 rows

    Will add more datasets in the following versions.

    Acknowledgements

    My sincere thanks to :

    • Malaikannan for starting this initiative
    • Selvakumar for getting the data
    • Vijay Anand for the Thirukkural data

    Inspiration

    Some questions which can be answered are

    1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
    2. How does the Language models do for Tamil?

    And lot more interesting questions to be answered.

    Checkout this link to find similar and dissimilar words for Tamil.

  2. F

    Tamil-English translated Parallel Corpora for Legal Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil-English translated Parallel Corpora for Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Legal domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,

    The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,

    These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,

    More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!

  3. Ponniyan selvan Tamil Book for NLP

    • kaggle.com
    zip
    Updated Sep 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/datasets/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
    Explore at:
    zip(1985053 bytes)Available download formats
    Dataset updated
    Sep 9, 2020
    Authors
    Dinesh Kumar Sarangapani
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Dinesh Kumar Sarangapani

    Released under CC0: Public Domain

    Contents

  4. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • explore.openaire.eu
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  5. F

    Tamil (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Tamil Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.

    Speech Data:

    This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Tamil speakers from different part of Tamil Nadu. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Tamil language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  6. E

    GlobalPhone Tamil

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Tamil [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1916
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Tamil corpus was produced using the Thinaboomi Tamil Daily newspaper. It contains recordings of 47 speakers (gender unspecified) recorded in India. No age distribution is available.

  7. d

    EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Apr 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
    Explore at:
    Dataset updated
    Apr 28, 2023
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  8. k

    Tamil---Language-Corpus-for-NLP

    • kaggle.com
    Updated Mar 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Tamil---Language-Corpus-for-NLP [Dataset]. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    தமிழ் மொழி கார்பஸ் தரவுத்தொகுப்பு - இயற்கை மொழி செயலாக்கம்

  9. F

    Tamil-English translated Parallel Corpora for Management Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil-English translated Parallel Corpora for Management Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-management-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This bilingual parallel corpus consists of 50K+ sentence text data translated to Tamil from English with the help of more than 200 native translators in the Management domain. These domain-specific parallel corpora have native language slang, phrases, and language-specific words, and follow the native way of talking, making the corpus more information-rich. Many of the same sentences are translated by various native translators, allowing us to compare how various groups interpret the same text.,

    The sentences in this comparable corpus range in length from 7 to 15 words. The data is accessible in excel format and can be converted into TMX, XML, XLIFF, or other equivalent formats. ,

    These parallel bilingual corpora can be utilised for the research and development of bilingual lexicography and machine translation engines. Additionally, it can be used to create numerous language databases for applications like predictive keyboards, spell checkers, grammar checkers, text/speech understanding systems, text-to-speech modules, and many others that are based on NLP.,

    More translated sentences are constantly being added to this parallel corpus. Depending on your unique requirements, we can curate numerous parallel corpora in various languages. For synthetic custom curation, do not forget to check out the FutureBeeAI community. The license for this parallel corpus dataset belongs to FutureBeeAI!

  10. d

    TAUS Language Translation Data | Parallel translation for Colloquial English...

    • datarade.ai
    Updated Dec 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAUS (2020). TAUS Language Translation Data | Parallel translation for Colloquial English into various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-colloquial-domain-english-low-resource-see-description-taus
    Explore at:
    .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Dec 15, 2020
    Dataset authored and provided by
    TAUS
    Area covered
    Bangladesh, Myanmar, Lao People's Democratic Republic, Turkey, Vietnam, Timor-Leste, Iraq, Iran (Islamic Republic of), Nepal, Indonesia
    Description

    The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

    This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

    English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab)

    Other languages are available on demand.

  11. P

    IndicCorp Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp
    Explore at:
    Dataset updated
    Mar 10, 2024
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

    Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

    Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

    Downloads

    Language# News Articles*SentencesTokensLink
    as0.60M1.39M32.6Mlink
    bn3.83M39.9M836Mlink
    en3.49M54.3M1.22Blink
    gu2.63M41.1M719Mlink
    hi4.95M63.1M1.86Blink
    kn3.76M53.3M713Mlink
    ml4.75M50.2M721Mlink
    mr2.31M34.0M551Mlink
    or0.69M6.94M107Mlink
    pa2.64M29.2M773Mlink
    ta4.41M31.5M582Mlink
    te3.98M47.9M674Mlink
    • Excluding articles obtained from the OSCAR corpus
  12. HPL Tamil Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Khadka (2024). HPL Tamil Dataset [Dataset]. https://www.kaggle.com/datasets/rohitkhadka375741/hpl-tamil
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 9, 2024
    Authors
    Rohit Khadka
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    "HPL Tamil" dataset serves as a valuable resource for anyone interested in studying and analyzing the Tamil language, facilitating advancements in computational linguistics and NLP research.

  13. XQA Tamil

    • kaggle.com
    zip
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manav Dhamani (2021). XQA Tamil [Dataset]. https://www.kaggle.com/mdhamani/xqa-tamil
    Explore at:
    zip(11676466 bytes)Available download formats
    Dataset updated
    Sep 30, 2021
    Authors
    Manav Dhamani
    Description

    Dataset

    This dataset was created by Manav Dhamani

    Contents

  14. Tamil - Language Corpus for NLP

    • kaggle.com
    Updated Apr 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen (2020). Tamil - Language Corpus for NLP [Dataset]. https://www.kaggle.com/praveengovi/tamil-language-corpus-for-nlp/discussion
    Explore at:
    Dataset updated
    Apr 8, 2020
    Dataset provided by
    Kaggle
    Authors
    Praveen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    https://cms.qz.com/wp-content/uploads/2017/04/tamil.jpg?quality=75&strip=all&w=1400" alt="">

    Context

    Tamil is one of the longest-surviving classical languages in the world.It described as "the only language of contemporary India which is recognizably continuous with a classical past. The variety and quality of classical Tamil literature has led to it being described as "one of the great classical traditions and literatures of the world".

    Tamil language Corpus helps researches,IT professionals and students to create tamil language models for classifying sentiments , Topic modeling , text summarisation , text generation ,Named Entity recognition ,Knowledge graph and Chatbot

    Content

    Tamil language Corpus consist of articles from Wikipedia & Tamil daily news , Dataset split into train and test for ease of use in building machine learning models

    Acknowledgements

    Thanks to Vanagamudi and Gaurov for contribution to tamil NLP and dataset used for their NLP is really helpful to prepare this dataset

    https://github.com/vanangamudi/tamil-lm2 https://github.com/goru001/nlp-for-tamil

    Inspiration

    Evolving the tamil language in Artificial Intelligence world & contribute to education and research

  15. Tamil Wikipedia Articles

    • kaggle.com
    Updated Dec 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav (2019). Tamil Wikipedia Articles [Dataset]. https://www.kaggle.com/disisbig/tamil-wikipedia-articles/notebooks
    Explore at:
    Dataset updated
    Dec 25, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This data set consists of 127k Wikipedia Articles which have been cleaned.

    It has a Train set and Validation set, which were used to train and benchmark Language Models for Tamil in the repository NLP for Tamil

    The scripts which were used to fetch and clean articles can be found here

    Thanks to Ravi for sharing this data set

    Feel free to use this data set creatively and for building better Language Models

  16. F

    Travel domain Human-Human conversation chats in Tamil

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Travel domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-travel-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the travel domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.,

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.,

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.,

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.,

    This training dataset's licence belongs to FutureBeeAI!

  17. E

    Data from: Tamil Dependency Treebank v0.1

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Oct 30, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Tamil Dependency Treebank v0.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1084
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Oct 30, 2014
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Tamil Dependency Treebank version 0.1 (TamilTB.v0.1) is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB.v0.1 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.

  18. h

    offenseval_dravidian

    • huggingface.co
    • opendatalab.com
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The HF Datasets community (2023). offenseval_dravidian [Dataset]. https://huggingface.co/datasets/offenseval_dravidian
    Explore at:
    Dataset updated
    Oct 19, 2023
    Dataset authored and provided by
    The HF Datasets community
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Offensive language identification in dravidian lanaguages dataset. The goal of this task is to identify offensive language content of the code-mixed dataset of comments/posts in Dravidian Languages ( (Tamil-English, Malayalam-English, and Kannada-English)) collected from social media.

  19. m

    Transliteration Sentence Dataset

    • data.mendeley.com
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Jabed Hosen (2024). Transliteration Sentence Dataset [Dataset]. http://doi.org/10.17632/38y7g2fcny.1
    Explore at:
    Dataset updated
    Apr 16, 2024
    Authors
    Md. Jabed Hosen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A transliteration sentence is like writing the same words but using different letters that sound the same. It helps people who speak different languages understand each other better. This dataset, drawn from 12 varied datasets initially intended for tasks such as sentiment analysis, hate speech detection, social media analysis, and review classification, endeavors to encompass a wide array of linguistic subtleties and fluctuations inherent in real-world language usage. Each data instance was meticulously labeled based on the language of the sentences. From this amalgamation of datasets, we curated a dataset comprising 65,473 instances, comprising 19,859 Bangla, 17,309 Hindi, 17,000 English, and 11,305 Tamil data instances, specifically tailored for transliteration sentence identification.

  20. P

    IndicGLUE Dataset

    • paperswithcode.com
    Updated Mar 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2022). IndicGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/indicglue
    Explore at:
    Dataset updated
    Mar 9, 2022
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp

IndicCorp Dataset

Explore at:
91 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 10, 2024
Authors
Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
Description

IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

Language# News Articles*SentencesTokensLink
as0.60M1.39M32.6Mlink
bn3.83M39.9M836Mlink
en3.49M54.3M1.22Blink
gu2.63M41.1M719Mlink
hi4.95M63.1M1.86Blink
kn3.76M53.3M713Mlink
ml4.75M50.2M721Mlink
mr2.31M34.0M551Mlink
or0.69M6.94M107Mlink
pa2.64M29.2M773Mlink
ta4.41M31.5M582Mlink
te3.98M47.9M674Mlink
  • Excluding articles obtained from the OSCAR corpus
Search
Clear search
Close search
Google apps
Main menu