Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique introduced in the paper. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Experiments suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.
Dataset Structure and File Format The dataset is itself divided into two subfolders, namely human annotated data and synthetically generated data. Under the human annotated data you can find the train, test and validation split whereas the synthetically generated data contains a single file with all the synthetically generated data.
The files themselves are in .tsv format. There are 5 coloumns which contain the English query, code-switched query, English parse, code-swaitched parse and the domain for each entry in that particular order.
This dataset was created by Shivajeet Rai
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Ankit Lakra
Released under Apache 2.0
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset aids in fine-tuning assistance or chatbot models to comprehend both Hinglish and English through Hindi, enhancing their ability to understand and respond effectively in this hybrid language for optimal performance.
Hinglish is a hybrid language, a blend of Hindi and English, commonly spoken in India. It combines vocabulary and grammar from both languages, often used in text conversations. The Hinglish dataset is crucial for fine-tuning open-source language models like LLAMA-2, which lack exposure to such data in training. In contrast, GPT-3 and later models have been trained on Hinglish data, making them more adept at understanding this hybrid language.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Stuti
Released under CC0: Public Domain
It contains the following files:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
This is a collection of text conversations in Hinglish (code mixing between Hindi-English) and their corresponding English versions. Can be used for Translating between the two. This dataset was generated by translating the first 5000 news content from the Inshorts Dataset - English News [https://www.kaggle.com/datasets/shivamtaneja2304/inshorts-dataset-english]
Languages
Hinglish English
Dataset Structure
An example from the json file… See the full description on the dataset page: https://huggingface.co/datasets/suyash2739/News_Hinglish_English.
The provided list contains common stop words used in natural language processing (NLP) tasks. Stop words are words that are filtered out before or after processing of natural language data. They are typically the most common words in a language and don't carry significant meaning, thus often removed to focus on the more important words or tokens in a text. This dataset can be used in various NLP applications such as text classification, sentiment analysis, and information retrieval to improve the accuracy and efficiency of text processing algorithms. By eliminating these stop words, the computational resources can be utilized more effectively, and the analysis can focus on the meaningful content of the text.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset has 1.12 million YouTube comments around the wrestler's protest. With the aim of shedding light on public sentiment and opinions surrounding this event, the dataset offers a resource for research, analysis, and insights into the dynamics of online discussions.
The dataset comprises comments in the language English, Hindi and Hinglish primarily.
The description of each column can be found in the attributes.txt file. The comments have been scraped using the official YouTube Data API.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Data Source:Bridgei2i Languages: Hindi, English, Hinglish The datasets contain articles and tweets and the respective mobile/non mobile tech flags. Can be used for text classification, text preprocessing, translation, transliteration application tasks.
DATA DICTIONARY
dev_article :
- Text_ID : unique article ids
- Text: Article Text Data
- Headline: Headline to the article
- Mobile_Tech_Flag: Flag shows whether article is related to mobile_tech
dev_tweet :
- Text_ID : unique tweet ids
- Text: Tweet Text Data
- Mobile_Tech_Flag: Flag shows whether tweet is related to mobile_tech
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Hinglish-TOP consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique introduced in the paper. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Experiments suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.
Dataset Structure and File Format The dataset is itself divided into two subfolders, namely human annotated data and synthetically generated data. Under the human annotated data you can find the train, test and validation split whereas the synthetically generated data contains a single file with all the synthetically generated data.
The files themselves are in .tsv format. There are 5 coloumns which contain the English query, code-switched query, English parse, code-swaitched parse and the domain for each entry in that particular order.