Source: CrisisMMD (Alam et al., 2017) Data Type: Multimodal — each sample includes: tweet_text (social media text) tweet_image (corresponding image from the tweet) Total Samples Used: ~18,802(from the dataset) Class Labels: 0 → Non-informative 1 → Informative Collect only values where tweet_text and tweet_image are equal. (thus collected 12,743 tweets and convert it into test and train .pt files) ✅ Preprocessing Done Text: Tokenized using BERT tokenizer (bert-base-uncased) Extracted input_ids… See the full description on the dataset page: https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1.
This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls
parameters matter, but I'm wondering about the position_ids
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']
MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.
Apache 2.0 License
Link to model on Hugging Face Hub
This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.
We use a training paradigm similar to multilingual bert, with a few modifications as listed:
We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.
The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below
We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
We have two types of parallel data
- Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.
We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.
The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.
All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.
This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.
@misc{khanuja2021muril,
title={MuRIL: Multilingual Representations for Indian Languages},
author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
year={2021},
eprint={2103.10730},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Source: CrisisMMD (Alam et al., 2017) Data Type: Multimodal — each sample includes: tweet_text (social media text) tweet_image (corresponding image from the tweet) Total Samples Used: ~18,802(from the dataset) Class Labels: 0 → Non-informative 1 → Informative Collect only values where tweet_text and tweet_image are equal. (thus collected 12,743 tweets and convert it into test and train .pt files) ✅ Preprocessing Done Text: Tokenized using BERT tokenizer (bert-base-uncased) Extracted input_ids… See the full description on the dataset page: https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1.