9 datasets found
  1. Hinglish TOP Dataset

    • kaggle.com
    • huggingface.co
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik Vinayan (2023). Hinglish TOP Dataset [Dataset]. https://www.kaggle.com/datasets/nexuswho/hinglish-top-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Karthik Vinayan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique introduced in the paper. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Experiments suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.

    Dataset Structure and File Format The dataset is itself divided into two subfolders, namely human annotated data and synthetically generated data. Under the human annotated data you can find the train, test and validation split whereas the synthetically generated data contains a single file with all the synthetically generated data.

    The files themselves are in .tsv format. There are 5 coloumns which contain the English query, code-switched query, English parse, code-swaitched parse and the domain for each entry in that particular order.

  2. hinglish-data-for-sentiment-analysis

    • kaggle.com
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivajeet Rai (2023). hinglish-data-for-sentiment-analysis [Dataset]. https://www.kaggle.com/datasets/shivajeetrai/hinglish-data-for-sentiment-analysis/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivajeet Rai
    Description

    Dataset

    This dataset was created by Shivajeet Rai

    Contents

  3. Hinglish sentiment

    • kaggle.com
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit Lakra (2024). Hinglish sentiment [Dataset]. https://www.kaggle.com/datasets/ankitlakraa/hinglish-sentiment/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankit Lakra
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Ankit Lakra

    Released under Apache 2.0

    Contents

  4. Conversation

    • kaggle.com
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahil Siddiki (2023). Conversation [Dataset]. https://www.kaggle.com/datasets/siddikisahil47/conversation/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sahil Siddiki
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset aids in fine-tuning assistance or chatbot models to comprehend both Hinglish and English through Hindi, enhancing their ability to understand and respond effectively in this hybrid language for optimal performance.

    Hinglish is a hybrid language, a blend of Hindi and English, commonly spoken in India. It combines vocabulary and grammar from both languages, often used in text conversations. The Hinglish dataset is crucial for fine-tuning open-source language models like LLAMA-2, which lack exposure to such data in training. In contrast, GPT-3 and later models have been trained on Hinglish data, making them more adept at understanding this hybrid language.

  5. Hinglish_Hindi_parallel_corpus

    • kaggle.com
    zip
    Updated Mar 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stuti (2020). Hinglish_Hindi_parallel_corpus [Dataset]. https://www.kaggle.com/stutig29/hinglish-hindi-parallel-corpus
    Explore at:
    zip(11374 bytes)Available download formats
    Dataset updated
    Mar 2, 2020
    Authors
    Stuti
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Stuti

    Released under CC0: Public Domain

    Contents

    It contains the following files:

  6. h

    News_Hinglish_English

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    suyash agarwal (2024). News_Hinglish_English [Dataset]. http://doi.org/10.57967/hf/5120
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    suyash agarwal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    This is a collection of text conversations in Hinglish (code mixing between Hindi-English) and their corresponding English versions. Can be used for Translating between the two. This dataset was generated by translating the first 5000 news content from the Inshorts Dataset - English News [https://www.kaggle.com/datasets/shivamtaneja2304/inshorts-dataset-english]

      Languages
    

    Hinglish English

      Dataset Structure
    

    An example from the json file… See the full description on the dataset page: https://huggingface.co/datasets/suyash2739/News_Hinglish_English.

  7. Stop Words Hinglish for NLP

    • kaggle.com
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranam Shetty (2024). Stop Words Hinglish for NLP [Dataset]. https://www.kaggle.com/datasets/prxshetty/stop-words-hinglish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pranam Shetty
    Description

    The provided list contains common stop words used in natural language processing (NLP) tasks. Stop words are words that are filtered out before or after processing of natural language data. They are typically the most common words in a language and don't carry significant meaning, thus often removed to focus on the more important words or tokens in a text. This dataset can be used in various NLP applications such as text classification, sentiment analysis, and information retrieval to improve the accuracy and efficiency of text processing algorithms. By eliminating these stop words, the computational resources can be utilized more effectively, and the analysis can focus on the meaningful content of the text.

  8. YouTube Comments on Wrestlers Protest

    • kaggle.com
    Updated Jul 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aatman Vaidya (2023). YouTube Comments on Wrestlers Protest [Dataset]. https://www.kaggle.com/datasets/aatmanvaidya/youtube-comments-on-wrestlers-protest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2023
    Dataset provided by
    Kaggle
    Authors
    Aatman Vaidya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    This dataset has 1.12 million YouTube comments around the wrestler's protest. With the aim of shedding light on public sentiment and opinions surrounding this event, the dataset offers a resource for research, analysis, and insights into the dynamics of online discussions.

    The dataset comprises comments in the language English, Hindi and Hinglish primarily.

    The description of each column can be found in the attributes.txt file. The comments have been scraped using the official YouTube Data API.

  9. Mobile/Non-mobile Tech Articles & Tweets

    • kaggle.com
    Updated Sep 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreya Sajal (2021). Mobile/Non-mobile Tech Articles & Tweets [Dataset]. https://www.kaggle.com/datasets/shreyasajal/mobilenonmobile-tech-articles-and-tweets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2021
    Dataset provided by
    Kaggle
    Authors
    Shreya Sajal
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Data Source:Bridgei2i Languages: Hindi, English, Hinglish The datasets contain articles and tweets and the respective mobile/non mobile tech flags. Can be used for text classification, text preprocessing, translation, transliteration application tasks.

    DATA DICTIONARY

    dev_article : 
      - Text_ID : unique article ids
      - Text: Article Text Data
      - Headline: Headline to the article
      - Mobile_Tech_Flag: Flag shows whether article is related to mobile_tech 
    dev_tweet : 
      - Text_ID : unique tweet ids
      - Text: Tweet Text Data
      - Mobile_Tech_Flag: Flag shows whether tweet is related to mobile_tech
    
  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Karthik Vinayan (2023). Hinglish TOP Dataset [Dataset]. https://www.kaggle.com/datasets/nexuswho/hinglish-top-dataset
Organization logo

Hinglish TOP Dataset

English-Hinglish Dataset

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Karthik Vinayan
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique introduced in the paper. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Experiments suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.

Dataset Structure and File Format The dataset is itself divided into two subfolders, namely human annotated data and synthetically generated data. Under the human annotated data you can find the train, test and validation split whereas the synthetically generated data contains a single file with all the synthetically generated data.

The files themselves are in .tsv format. There are 5 coloumns which contain the English query, code-switched query, English parse, code-swaitched parse and the domain for each entry in that particular order.

Search
Clear search
Close search
Google apps
Main menu