2 datasets found

h
crisisMMD_cleaned_task1
huggingface.co
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henishma AR (2024). crisisMMD_cleaned_task1 [Dataset]. https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1
Explore at:
Dataset updated
Aug 9, 2024
Authors
Henishma AR
Description
Source: CrisisMMD (Alam et al., 2017) Data Type: Multimodal — each sample includes: tweet_text (social media text) tweet_image (corresponding image from the tweet) Total Samples Used: ~18,802(from the dataset) Class Labels: 0 → Non-informative 1 → Informative Collect only values where tweet_text and tweet_image are equal. (thus collected 12,743 tweets and convert it into test and train .pt files) ✅ Preprocessing Done Text: Tokenized using BERT tokenizer (bert-base-uncased) Extracted input_ids… See the full description on the dataset page: https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1.
MuRIL Large tf
kaggle.com
Updated Oct 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2021). MuRIL Large tf [Dataset]. https://www.kaggle.com/nbroad/muril-large-tf/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nicholas Broad
Description
This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']

MuRIL: Multilingual Representations for Indian Language

MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

Apache 2.0 License

Link to model on Hugging Face Hub

Overview

This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

We use a training paradigm similar to multilingual bert, with a few modifications as listed:

We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

Training

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

Monolingual Data

We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

Parallel Data

We have two types of parallel data - Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

Trainable parameters

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

Uses & Limitations

This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

Citation

@misc{khanuja2021muril, title={MuRIL: Multilingual Representations for Indian Languages}, author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar}, year={2021}, eprint={2103.10730}, archivePrefix={arXiv}, primaryClass={cs.CL} }

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Henishma AR (2024). crisisMMD_cleaned_task1 [Dataset]. https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1

crisisMMD_cleaned_task1

Henishma/crisisMMD_cleaned_task1

Explore at:

Dataset updated

Aug 9, 2024

Authors

Henishma AR

Description

Source: CrisisMMD (Alam et al., 2017) Data Type: Multimodal — each sample includes: tweet_text (social media text) tweet_image (corresponding image from the tweet) Total Samples Used: ~18,802(from the dataset) Class Labels: 0 → Non-informative 1 → Informative Collect only values where tweet_text and tweet_image are equal. (thus collected 12,743 tweets and convert it into test and train .pt files) ✅ Preprocessing Done Text: Tokenized using BERT tokenizer (bert-base-uncased) Extracted input_ids… See the full description on the dataset page: https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1.

Clear search

Close search

Google apps

Main menu

crisisMMD_cleaned_task1

MuRIL Large tf

MuRIL: Multilingual Representations for Indian Language

Overview

Training

Monolingual Data

Parallel Data

Trainable parameters

Uses & Limitations

Citation

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...

crisisMMD_cleaned_task1

Henishma/crisisMMD_cleaned_task1