9 datasets found

Hinglish TOP Dataset
kaggle.com
huggingface.co
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthik Vinayan (2023). Hinglish TOP Dataset [Dataset]. https://www.kaggle.com/datasets/nexuswho/hinglish-top-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Karthik Vinayan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique introduced in the paper. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Experiments suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.

Dataset Structure and File Format The dataset is itself divided into two subfolders, namely human annotated data and synthetically generated data. Under the human annotated data you can find the train, test and validation split whereas the synthetically generated data contains a single file with all the synthetically generated data.

The files themselves are in .tsv format. There are 5 coloumns which contain the English query, code-switched query, English parse, code-swaitched parse and the domain for each entry in that particular order.
hinglish-data-for-sentiment-analysis
kaggle.com
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivajeet Rai (2023). hinglish-data-for-sentiment-analysis [Dataset]. https://www.kaggle.com/datasets/shivajeetrai/hinglish-data-for-sentiment-analysis/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivajeet Rai
Description
Dataset

This dataset was created by Shivajeet Rai

Contents
Hinglish sentiment
kaggle.com
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Lakra (2024). Hinglish sentiment [Dataset]. https://www.kaggle.com/datasets/ankitlakraa/hinglish-sentiment/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ankit Lakra
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Ankit Lakra

Released under Apache 2.0

Contents
Conversation
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Siddiki (2023). Conversation [Dataset]. https://www.kaggle.com/datasets/siddikisahil47/conversation/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sahil Siddiki
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset aids in fine-tuning assistance or chatbot models to comprehend both Hinglish and English through Hindi, enhancing their ability to understand and respond effectively in this hybrid language for optimal performance.

Hinglish is a hybrid language, a blend of Hindi and English, commonly spoken in India. It combines vocabulary and grammar from both languages, often used in text conversations. The Hinglish dataset is crucial for fine-tuning open-source language models like LLAMA-2, which lack exposure to such data in training. In contrast, GPT-3 and later models have been trained on Hinglish data, making them more adept at understanding this hybrid language.
Hinglish_Hindi_parallel_corpus
kaggle.com
zip
Updated Mar 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuti (2020). Hinglish_Hindi_parallel_corpus [Dataset]. https://www.kaggle.com/stutig29/hinglish-hindi-parallel-corpus
Explore at:
zip(11374 bytes)Available download formats
Dataset updated
Mar 2, 2020
Authors
Stuti
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Stuti

Released under CC0: Public Domain

Contents

It contains the following files:
h
News_Hinglish_English
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
suyash agarwal (2024). News_Hinglish_English [Dataset]. http://doi.org/10.57967/hf/5120
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/5120
Dataset updated
Jun 21, 2024
Authors
suyash agarwal
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

This is a collection of text conversations in Hinglish (code mixing between Hindi-English) and their corresponding English versions. Can be used for Translating between the two. This dataset was generated by translating the first 5000 news content from the Inshorts Dataset - English News [https://www.kaggle.com/datasets/shivamtaneja2304/inshorts-dataset-english]

Languages

Hinglish English

Dataset Structure

An example from the json file… See the full description on the dataset page: https://huggingface.co/datasets/suyash2739/News_Hinglish_English.
Stop Words Hinglish for NLP
kaggle.com
Updated Mar 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranam Shetty (2024). Stop Words Hinglish for NLP [Dataset]. https://www.kaggle.com/datasets/prxshetty/stop-words-hinglish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pranam Shetty
Description
The provided list contains common stop words used in natural language processing (NLP) tasks. Stop words are words that are filtered out before or after processing of natural language data. They are typically the most common words in a language and don't carry significant meaning, thus often removed to focus on the more important words or tokens in a text. This dataset can be used in various NLP applications such as text classification, sentiment analysis, and information retrieval to improve the accuracy and efficiency of text processing algorithms. By eliminating these stop words, the computational resources can be utilized more effectively, and the analysis can focus on the meaningful content of the text.
YouTube Comments on Wrestlers Protest
kaggle.com
Updated Jul 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aatman Vaidya (2023). YouTube Comments on Wrestlers Protest [Dataset]. https://www.kaggle.com/datasets/aatmanvaidya/youtube-comments-on-wrestlers-protest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2023
Dataset provided by
Kaggle
Authors
Aatman Vaidya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
This dataset has 1.12 million YouTube comments around the wrestler's protest. With the aim of shedding light on public sentiment and opinions surrounding this event, the dataset offers a resource for research, analysis, and insights into the dynamics of online discussions.

The dataset comprises comments in the language English, Hindi and Hinglish primarily.

The description of each column can be found in the attributes.txt file. The comments have been scraped using the official YouTube Data API.
Mobile/Non-mobile Tech Articles & Tweets
kaggle.com
Updated Sep 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreya Sajal (2021). Mobile/Non-mobile Tech Articles & Tweets [Dataset]. https://www.kaggle.com/datasets/shreyasajal/mobilenonmobile-tech-articles-and-tweets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2021
Dataset provided by
Kaggle
Authors
Shreya Sajal
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Data Source:Bridgei2i Languages: Hindi, English, Hinglish The datasets contain articles and tweets and the respective mobile/non mobile tech flags. Can be used for text classification, text preprocessing, translation, transliteration application tasks.

DATA DICTIONARY

dev_article : - Text_ID : unique article ids - Text: Article Text Data - Headline: Headline to the article - Mobile_Tech_Flag: Flag shows whether article is related to mobile_tech dev_tweet : - Text_ID : unique tweet ids - Text: Tweet Text Data - Mobile_Tech_Flag: Flag shows whether tweet is related to mobile_tech
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Karthik Vinayan (2023). Hinglish TOP Dataset [Dataset]. https://www.kaggle.com/datasets/nexuswho/hinglish-top-dataset

Hinglish TOP Dataset

English-Hinglish Dataset

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 5, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Karthik Vinayan

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique introduced in the paper. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Experiments suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.

Dataset Structure and File Format The dataset is itself divided into two subfolders, namely human annotated data and synthetically generated data. Under the human annotated data you can find the train, test and validation split whereas the synthetically generated data contains a single file with all the synthetically generated data.

The files themselves are in .tsv format. There are 5 coloumns which contain the English query, code-switched query, English parse, code-swaitched parse and the domain for each entry in that particular order.

Clear search

Close search

Google apps

Main menu

Hinglish TOP Dataset

hinglish-data-for-sentiment-analysis

Dataset

Contents

Hinglish sentiment

Dataset

Contents

Conversation

Hinglish_Hindi_parallel_corpus

Dataset

Contents

News_Hinglish_English

Stop Words Hinglish for NLP

YouTube Comments on Wrestlers Protest

Mobile/Non-mobile Tech Articles & Tweets

Hinglish TOP Dataset

English-Hinglish Dataset