100+ datasets found
  1. Tamil NLP

    • kaggle.com
    Updated Mar 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/datasets/sudalairajkumar/tamil-nlp/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SRK
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Indic NLP - Natural Language Processing for Indian Languages.

    This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

    Content

    The dataset has the following files.

    Tamil News Classficaition

    This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

    • tamil_news_train.csv - Train dataset for tamil news classification.
    • tamil_news_test.csv - Test dataset for tamil news classification

    Tamil Movie Review Dataset

    This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

    • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
    • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

    Thirukkural Dataset

    From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

    I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

    • tamil_thirukkural_train - train dataset having 1064 rows
    • tamil_thirukkural_test - test dataset having 266 rows

    Will add more datasets in the following versions.

    Acknowledgements

    My sincere thanks to :

    • Malaikannan for starting this initiative
    • Selvakumar for getting the data
    • Vijay Anand for the Thirukkural data

    Inspiration

    Some questions which can be answered are

    1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
    2. How does the Language models do for Tamil?

    And lot more interesting questions to be answered.

    Checkout this link to find similar and dissimilar words for Tamil.

  2. F

    Tamil (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Tamil Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.

    Speech Data:

    This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Tamil speakers from different part of Tamil Nadu. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Tamil language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  3. HPL Tamil Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Khadka (2024). HPL Tamil Dataset [Dataset]. https://www.kaggle.com/datasets/rohitkhadka375741/hpl-tamil
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 9, 2024
    Authors
    Rohit Khadka
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    "HPL Tamil" dataset serves as a valuable resource for anyone interested in studying and analyzing the Tamil language, facilitating advancements in computational linguistics and NLP research.

  4. F

    General domain Human-Human conversation chats in Tamil

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  5. Tamil News Dataset

    • kaggle.com
    Updated Jan 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav (2020). Tamil News Dataset [Dataset]. https://www.kaggle.com/disisbig/tamil-news-dataset/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This data set contains ~6500 news articles which were collected by Ravi from Tamil news websites.

    The data set has been cleaned and contains train and test set using which you can benchmark your classification models in Tamil

    The scripts which were used to create the data set can be found here

    Credits: Full credit to Ravi for this Data set. Also, Thanks to thetamilhindu headline crawler built using news crawler from vanangamudi

  6. d

    EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Apr 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
    Explore at:
    Dataset updated
    Apr 28, 2023
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  7. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  8. F

    Tamil Conversation Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 12,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 200+ native Tamil participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Appointment Scheduling
    New Patient Registration
    Surgery Consultation
    Consultation regarding Diet, and many more
    Outbound Chats:
    Appointment Reminder
    Health & Wellness Subscription Programs
    Lab Test Results
    Health Risk Assessments
    Preventive Care Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Tamil Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Tamil speakers in Healthcare contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Tamil personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Tamil-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Tamil forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Tamil Healthcare conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Tamil Healthcare interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    Feedback, etc

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat messages, designed to

  9. F

    Tamil Brainstorming Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Brainstorming Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-brainstorming-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Tamil Brainstorming Prompt-Response Dataset, a meticulously curated collection of 2000 prompt and response pairs. This dataset is a valuable resource for enhancing the creative and generative abilities of Language Models (LMs), a critical aspect in advancing generative AI.

    Dataset Content: This brainstorming dataset comprises a diverse set of prompts and responses where the prompt contains instruction, context, constraints, and restrictions while completion contains the most accurate response list for the given prompt. Both these prompts and completions are available in Tamil language.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.

    Prompt Diversity: To ensure diversity, our brainstorming dataset features prompts of varying complexity levels, ranging from easy to medium and hard. The prompts also vary in length, including short, medium, and long prompts, providing a comprehensive range. Furthermore, the dataset includes prompts with constraints and persona restrictions, making it exceptionally valuable for LLM training.Response Formats: Our dataset accommodates diverse learning experiences, offering responses across different domains depending on the prompt. For these brainstorming prompts, responses are generally provided in list format. These responses encompass text strings, numerical values, and dates, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Tamil Brainstorming Prompt Completion Dataset is available in both JSON and CSV formats. It includes comprehensive annotation details, including a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, and the presence of rich text.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. We continuously work to expand this dataset, ensuring its ongoing growth and relevance. Additionally, FutureBeeAI offers the flexibility to curate custom brainstorming prompt and completion datasets tailored to specific requirements, providing you with customization options.License: This dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Brainstorming Prompt-Completion Dataset to enhance the creative and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  10. XQA Tamil

    • kaggle.com
    zip
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manav Dhamani (2021). XQA Tamil [Dataset]. https://www.kaggle.com/mdhamani/xqa-tamil
    Explore at:
    zip(11676466 bytes)Available download formats
    Dataset updated
    Sep 30, 2021
    Authors
    Manav Dhamani
    Description

    Dataset

    This dataset was created by Manav Dhamani

    Contents

  11. F

    Tamil Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Tamil Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems. Dataset Content: This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Tamil language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more. Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Tamil people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references. Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format. Prompt Diversity: To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others. These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments. Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers. Data Format and Annotation Details: This fully labeled Tamil Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence. Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance. The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset. Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options. License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  12. E

    Data from: Text classification model fastText-Trendi-Topics 1.0

    • live.european-language-grid.eu
    • clarin.si
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Text classification model fastText-Trendi-Topics 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20819
    Explore at:
    Dataset updated
    Oct 27, 2022
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc.

    The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf

    The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code).

    The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67).

    Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.

  13. F

    Tamil Open Ended Classification Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-open-ended-classification-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Tamil Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.

    Dataset Content: This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Tamil language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Prompt Diversity: To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Tamil Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  14. h

    bhasha-wiki

    • huggingface.co
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soket Labs (2024). bhasha-wiki [Dataset]. https://huggingface.co/datasets/soketlabs/bhasha-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2024
    Dataset authored and provided by
    Soket Labs
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for Bhasha-Wiki

    Translated wikipedia articles

      Dataset Details
    

    Dataset is being updated

      Dataset Description
    

    We have translated 6.4 million English wikipedia articles into 6 Indic languages. The translations were done using IndicTrans2 model.

    Curated by: Soket AI labs Language(s) (NLP): Hindi, Bengali, Gujarati, Tamil, Kannada, Urdu License: cc-by-sa-3.0

      Uses
    

    For pretraining or Fine tuning for Indic language models… See the full description on the dataset page: https://huggingface.co/datasets/soketlabs/bhasha-wiki.

  15. d

    TAUS Language Translation Data | Parallel translation for Colloquial English...

    • datarade.ai
    Updated Dec 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAUS (2020). TAUS Language Translation Data | Parallel translation for Colloquial English into various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-colloquial-domain-english-low-resource-see-description-taus
    Explore at:
    .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Dec 15, 2020
    Dataset authored and provided by
    TAUS
    Area covered
    Myanmar, Indonesia, Nepal, Iraq, Iran (Islamic Republic of), Turkey, Lao People's Democratic Republic, Bangladesh, Timor-Leste, Vietnam
    Description

    The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

    This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

    English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab)

    Other languages are available on demand.

  16. Ponniyan selvan Tamil Book for NLP

    • kaggle.com
    zip
    Updated Sep 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
    Explore at:
    zip(1985053 bytes)Available download formats
    Dataset updated
    Sep 9, 2020
    Authors
    Dinesh Kumar Sarangapani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Dinesh Kumar Sarangapani

    Released under CC0: Public Domain

    Contents

  17. E

    HeLI-OTS: Language Identifier

    • live.european-language-grid.eu
    Updated Nov 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). HeLI-OTS: Language Identifier [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/18085
    Explore at:
    Dataset updated
    Nov 4, 2022
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    HeLI off-the-shelf language identifier with language models for 200 languages. The API endpoint returns a list of n-best language predictions each with an associated score, lower scores represent higher confidence.

    The original HeLI-OTS code is published under the Apache Licence version 2.0, and is copyright Tommi Jauhiainen and Heidi Jauhiainen, University of Helsinki (2022). Further information: documentation and landing page.

  18. E

    Tilde Automatic Speech Recognition (ASR), Lithuanian Language

    • live.european-language-grid.eu
    Updated Dec 31, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Tilde Automatic Speech Recognition (ASR), Lithuanian Language [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/621
    Explore at:
    Dataset updated
    Dec 31, 2013
    License

    https://tilde.com/products-and-services/machine-translationhttps://tilde.com/products-and-services/machine-translation

    Description

    Tilde has worked on spoken language processing since the late 1990s. The special attention is paid to data sparseness problem that is typical for morphologically rich languages and to novel methods for data acquisition from the web. Tilde continues research on speech recognition by adapting developed technologies for new languages and for specific domains.

  19. E

    TurkuNLP: Finnish NER

    • live.european-language-grid.eu
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). TurkuNLP: Finnish NER [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20211
    Explore at:
    Dataset updated
    Sep 8, 2022
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Finnish named entity recognizer based on the work of TurkuNLP (Sampo Pyysalo et al.) See documentation and landing page for further information.

  20. P

    IndicNLP Corpus Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Apr 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anoop Kunchukuttan; Divyanshu Kakwani; Satish Golla; Gokul N. C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar (2020). IndicNLP Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/indicnlp-corpus
    Explore at:
    Dataset updated
    Apr 29, 2020
    Authors
    Anoop Kunchukuttan; Divyanshu Kakwani; Satish Golla; Gokul N. C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar
    Description

    The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SRK (2019). Tamil NLP [Dataset]. https://www.kaggle.com/datasets/sudalairajkumar/tamil-nlp/data
Organization logo

Tamil NLP

Datasets for Natural Language Processing in Tamil

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SRK
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

  • tamil_news_train.csv - Train dataset for tamil news classification.
  • tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

  • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
  • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

  • tamil_thirukkural_train - train dataset having 1064 rows
  • tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

  • Malaikannan for starting this initiative
  • Selvakumar for getting the data
  • Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

  1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
  2. How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.

Search
Clear search
Close search
Google apps
Main menu