39 datasets found
  1. Ponniyan selvan Tamil Book for NLP

    • kaggle.com
    zip
    Updated Sep 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
    Explore at:
    zip(1985053 bytes)Available download formats
    Dataset updated
    Sep 9, 2020
    Authors
    Dinesh Kumar Sarangapani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Dinesh Kumar Sarangapani

    Released under CC0: Public Domain

    Contents

  2. F

    Tamil (India) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil (India) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-tamil-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Tamil Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Tamil language speech recognition models, with a particular focus on Indian accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Tamil language spoken in India.

    Speech Data:

    This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Tamil speakers from different part of Tamil Nadu. This collaborative effort guarantees a balanced representation of Indian accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of Tamil language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  3. HPL Tamil Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Khadka (2024). HPL Tamil Dataset [Dataset]. https://www.kaggle.com/datasets/rohitkhadka375741/hpl-tamil
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 9, 2024
    Authors
    Rohit Khadka
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    "HPL Tamil" dataset serves as a valuable resource for anyone interested in studying and analyzing the Tamil language, facilitating advancements in computational linguistics and NLP research.

  4. d

    EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND

    • b2find.dkrz.de
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/9eb44325-3708-574f-a0da-4e8ccff2aa66
    Explore at:
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

  5. F

    Tamil Open Ended Classification Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-open-ended-classification-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Tamil Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.

    Dataset Content: This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Tamil language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Prompt Diversity: To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Tamil Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  6. Claim Detection and Matching for Indian Languages

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale (2021). Claim Detection and Matching for Indian Languages [Dataset]. http://doi.org/10.5281/zenodo.4890950
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale; Ashkan Kazemi; Kiran Garimella; Devin Gaffney; Scott A. Hale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.

    The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".

    The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".

    All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.

    , etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
    Kazemi, A.; Garimella, K.; Gaffney, D.; and Hale, S. A. 2021. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, ACL 2021.

  7. F

    General domain Human-Human conversation chats in Tamil

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General domain Human-Human conversation chats in Tamil [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/tamil-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    This training dataset comprises more than 10,000 conversational text data between two native Tamil people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.

    These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.

    These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.

    This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.

    This training dataset's licence belongs to FutureBeeAI!

  8. F

    Tamil Brainstorming Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Brainstorming Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-brainstorming-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Tamil Brainstorming Prompt-Response Dataset, a meticulously curated collection of 2000 prompt and response pairs. This dataset is a valuable resource for enhancing the creative and generative abilities of Language Models (LMs), a critical aspect in advancing generative AI.

    Dataset Content: This brainstorming dataset comprises a diverse set of prompts and responses where the prompt contains instruction, context, constraints, and restrictions while completion contains the most accurate response list for the given prompt. Both these prompts and completions are available in Tamil language.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.

    Prompt Diversity: To ensure diversity, our brainstorming dataset features prompts of varying complexity levels, ranging from easy to medium and hard. The prompts also vary in length, including short, medium, and long prompts, providing a comprehensive range. Furthermore, the dataset includes prompts with constraints and persona restrictions, making it exceptionally valuable for LLM training.Response Formats: Our dataset accommodates diverse learning experiences, offering responses across different domains depending on the prompt. For these brainstorming prompts, responses are generally provided in list format. These responses encompass text strings, numerical values, and dates, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Tamil Brainstorming Prompt Completion Dataset is available in both JSON and CSV formats. It includes comprehensive annotation details, including a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, and the presence of rich text.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. We continuously work to expand this dataset, ensuring its ongoing growth and relevance. Additionally, FutureBeeAI offers the flexibility to curate custom brainstorming prompt and completion datasets tailored to specific requirements, providing you with customization options.License: This dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Brainstorming Prompt-Completion Dataset to enhance the creative and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  9. XQA Tamil

    • kaggle.com
    zip
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manav Dhamani (2021). XQA Tamil [Dataset]. https://www.kaggle.com/mdhamani/xqa-tamil
    Explore at:
    zip(11676466 bytes)Available download formats
    Dataset updated
    Sep 30, 2021
    Authors
    Manav Dhamani
    Description

    Dataset

    This dataset was created by Manav Dhamani

    Contents

  10. d

    TAUS Language Translation Data | Parallel translation for Colloquial English...

    • datarade.ai
    Updated Dec 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAUS (2020). TAUS Language Translation Data | Parallel translation for Colloquial English into various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-colloquial-domain-english-low-resource-see-description-taus
    Explore at:
    .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Dec 15, 2020
    Dataset authored and provided by
    TAUS
    Area covered
    Nepal, Bangladesh, Myanmar, Indonesia, Lao People's Democratic Republic, Iran (Islamic Republic of), Iraq, Vietnam, Timor-Leste, Turkey
    Description

    The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

    This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

    English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab)

    Other languages are available on demand.

  11. F

    Tamil Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Tamil Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems. Dataset Content: This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Tamil language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more. Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Tamil people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references. Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format. Prompt Diversity: To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others. These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments. Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers. Data Format and Annotation Details: This fully labeled Tamil Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence. Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance. The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset. Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options. License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  12. h

    offenseval_dravidian

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Community Datasets, offenseval_dravidian [Dataset]. https://huggingface.co/datasets/community-datasets/offenseval_dravidian
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Community Datasets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Offenseval Dravidian

      Dataset Summary
    

    Offensive language identification is classification task in natural language processing (NLP) where the aim is to moderate and minimise offensive content in social media. It has been an active area of research in both academia and industry for the past two decades. There is an increasing demand for offensive language identification on social media texts which are largely code-mixed. Code-mixing is a prevalent… See the full description on the dataset page: https://huggingface.co/datasets/community-datasets/offenseval_dravidian.

  13. P

    IndicCorp Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2024). IndicCorp Dataset [Dataset]. https://paperswithcode.com/dataset/indiccorp
    Explore at:
    Dataset updated
    Mar 10, 2024
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

    Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

    Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

    Downloads

    Language# News Articles*SentencesTokensLink
    as0.60M1.39M32.6Mlink
    bn3.83M39.9M836Mlink
    en3.49M54.3M1.22Blink
    gu2.63M41.1M719Mlink
    hi4.95M63.1M1.86Blink
    kn3.76M53.3M713Mlink
    ml4.75M50.2M721Mlink
    mr2.31M34.0M551Mlink
    or0.69M6.94M107Mlink
    pa2.64M29.2M773Mlink
    ta4.41M31.5M582Mlink
    te3.98M47.9M674Mlink
    • Excluding articles obtained from the OSCAR corpus
  14. m

    IndicDialogue Dataset

    • data.mendeley.com
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noor Mairukh Khan Arnob (2024). IndicDialogue Dataset [Dataset]. http://doi.org/10.17632/wcb4bxbyxx.2
    Explore at:
    Dataset updated
    Jun 11, 2024
    Authors
    Noor Mairukh Khan Arnob
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    The IndicDialogue dataset contains raw subtitle SRT files and dialogues extracted from them. The subtitles are in 10 indic languages, namely Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali and Assamese. This dataset provides a corpus for performing various NLP tasks in low-resource languages using SLMs(Small Language Models) and LLMs(Large Language Models).

  15. m

    Transliteration Sentence Dataset

    • data.mendeley.com
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Jabed Hosen (2024). Transliteration Sentence Dataset [Dataset]. http://doi.org/10.17632/38y7g2fcny.1
    Explore at:
    Dataset updated
    Apr 16, 2024
    Authors
    Md. Jabed Hosen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A transliteration sentence is like writing the same words but using different letters that sound the same. It helps people who speak different languages understand each other better. This dataset, drawn from 12 varied datasets initially intended for tasks such as sentiment analysis, hate speech detection, social media analysis, and review classification, endeavors to encompass a wide array of linguistic subtleties and fluctuations inherent in real-world language usage. Each data instance was meticulously labeled based on the language of the sentences. From this amalgamation of datasets, we curated a dataset comprising 65,473 instances, comprising 19,859 Bangla, 17,309 Hindi, 17,000 English, and 11,305 Tamil data instances, specifically tailored for transliteration sentence identification.

  16. f

    Data_Sheet_1_Development and testing of a multi-lingual Natural Language...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lily Wei Yun Yang; Wei Yan Ng; Xiaofeng Lei; Shaun Chern Yuan Tan; Zhaoran Wang; Ming Yan; Mohan Kashyap Pargi; Xiaoman Zhang; Jane Sujuan Lim; Dinesh Visva Gunasekeran; Franklin Chee Ping Tan; Chen Ee Lee; Khung Keong Yeo; Hiang Khoon Tan; Henry Sun Sien Ho; Benedict Wee Bor Tan; Tien Yin Wong; Kenneth Yung Chiang Kwek; Rick Siow Mong Goh; Yong Liu; Daniel Shu Wei Ting (2023). Data_Sheet_1_Development and testing of a multi-lingual Natural Language Processing-based deep learning system in 10 languages for COVID-19 pandemic crisis: A multi-center study.docx [Dataset]. http://doi.org/10.3389/fpubh.2023.1063466.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Lily Wei Yun Yang; Wei Yan Ng; Xiaofeng Lei; Shaun Chern Yuan Tan; Zhaoran Wang; Ming Yan; Mohan Kashyap Pargi; Xiaoman Zhang; Jane Sujuan Lim; Dinesh Visva Gunasekeran; Franklin Chee Ping Tan; Chen Ee Lee; Khung Keong Yeo; Hiang Khoon Tan; Henry Sun Sien Ho; Benedict Wee Bor Tan; Tien Yin Wong; Kenneth Yung Chiang Kwek; Rick Siow Mong Goh; Yong Liu; Daniel Shu Wei Ting
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PurposeThe COVID-19 pandemic has drastically disrupted global healthcare systems. With the higher demand for healthcare and misinformation related to COVID-19, there is a need to explore alternative models to improve communication. Artificial Intelligence (AI) and Natural Language Processing (NLP) have emerged as promising solutions to improve healthcare delivery. Chatbots could fill a pivotal role in the dissemination and easy accessibility of accurate information in a pandemic. In this study, we developed a multi-lingual NLP-based AI chatbot, DR-COVID, which responds accurately to open-ended, COVID-19 related questions. This was used to facilitate pandemic education and healthcare delivery.MethodsFirst, we developed DR-COVID with an ensemble NLP model on the Telegram platform (https://t.me/drcovid_nlp_chatbot). Second, we evaluated various performance metrics. Third, we evaluated multi-lingual text-to-text translation to Chinese, Malay, Tamil, Filipino, Thai, Japanese, French, Spanish, and Portuguese. We utilized 2,728 training questions and 821 test questions in English. Primary outcome measurements were (A) overall and top 3 accuracies; (B) Area Under the Curve (AUC), precision, recall, and F1 score. Overall accuracy referred to a correct response for the top answer, whereas top 3 accuracy referred to an appropriate response for any one answer amongst the top 3 answers. AUC and its relevant matrices were obtained from the Receiver Operation Characteristics (ROC) curve. Secondary outcomes were (A) multi-lingual accuracy; (B) comparison to enterprise-grade chatbot systems. The sharing of training and testing datasets on an open-source platform will also contribute to existing data.ResultsOur NLP model, utilizing the ensemble architecture, achieved overall and top 3 accuracies of 0.838 [95% confidence interval (CI): 0.826–0.851] and 0.922 [95% CI: 0.913–0.932] respectively. For overall and top 3 results, AUC scores of 0.917 [95% CI: 0.911–0.925] and 0.960 [95% CI: 0.955–0.964] were achieved respectively. We achieved multi-linguicism with nine non-English languages, with Portuguese performing the best overall at 0.900. Lastly, DR-COVID generated answers more accurately and quickly than other chatbots, within 1.12–2.15 s across three devices tested.ConclusionDR-COVID is a clinically effective NLP-based conversational AI chatbot, and a promising solution for healthcare delivery in the pandemic era.

  17. h

    tatoeba_mt

    • huggingface.co
    • opendatalab.com
    Updated Mar 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2022). tatoeba_mt [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt
    Explore at:
    Dataset updated
    Mar 4, 2022
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    The Tatoeba Translation Challenge is a multilingual data set of machine translation benchmarks derived from user-contributed translations collected by Tatoeba.org and provided as parallel corpus from OPUS. This dataset includes test and development data sorted by language pair. It includes test sets for hundreds of language pairs and is continuously updated. Please, check the version number tag to refer to the release that your are using.

  18. A

    ‘Language Detection’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Language Detection’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-language-detection-bd39/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Language Detection’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/basilb2s/language-detection on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About the Dataset

    It's a small language detection dataset. This dataset consists of text details for 17 different languages, ie, you will be able to create an NLP model for predicting 17 different language..

    Languages

    1) English 2) Malayalam 3) Hindi 4) Tamil 5) Kannada 6) French 7) Spanish 8) Portuguese 9) Italian 10) Russian 11) Sweedish 12) Dutch 13) Arabic 14) Turkish 15) German 16) Danish 17) Greek

    --- Original source retains full ownership of the source dataset ---

  19. P

    IndicGLUE Dataset

    • paperswithcode.com
    Updated Mar 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar. (2022). IndicGLUE Dataset [Dataset]. https://paperswithcode.com/dataset/indicglue
    Explore at:
    Dataset updated
    Mar 9, 2022
    Authors
    Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.
    Description

    We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.

  20. h

    Tamil_Tamizh_Wikipedia_Text_Dataset_for_NLP

    • huggingface.co
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Younus (2024). Tamil_Tamizh_Wikipedia_Text_Dataset_for_NLP [Dataset]. https://huggingface.co/datasets/younusmohamed77/Tamil_Tamizh_Wikipedia_Text_Dataset_for_NLP
    Explore at:
    Dataset updated
    Dec 1, 2024
    Authors
    Mohamed Younus
    Description

    About this directory File Information

    Tamil Wikipedia Text Articles/ (Folder) Description: This folder contains the raw text extracted from individual Tamil Wikipedia articles. Each .txt file represents a single article in plain text, organized into subdirectories (e.g., AA, AB) to mirror the structure of the original Wikipedia dump.

    Format: .txt files, UTF-8 encoded. Use Cases: Suitable for language modeling, text analysis, NLP preprocessing, and other language research tasks.

    LICENSE… See the full description on the dataset page: https://huggingface.co/datasets/younusmohamed77/Tamil_Tamizh_Wikipedia_Text_Dataset_for_NLP.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dinesh Kumar Sarangapani (2020). Ponniyan selvan Tamil Book for NLP [Dataset]. https://www.kaggle.com/dineshkumarsarang/ponniyan-selvan-tamil-book-for-nlp
Organization logo

Ponniyan selvan Tamil Book for NLP

Use Famous Tamil Book for Tamil NLP notebooks

Explore at:
zip(1985053 bytes)Available download formats
Dataset updated
Sep 9, 2020
Authors
Dinesh Kumar Sarangapani
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Dinesh Kumar Sarangapani

Released under CC0: Public Domain

Contents

Search
Clear search
Close search
Google apps
Main menu