2 datasets found
  1. h

    crisisMMD_cleaned_task1

    • huggingface.co
    Updated Aug 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henishma AR (2024). crisisMMD_cleaned_task1 [Dataset]. https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1
    Explore at:
    Dataset updated
    Aug 9, 2024
    Authors
    Henishma AR
    Description

    Source: CrisisMMD (Alam et al., 2017) Data Type: Multimodal — each sample includes: tweet_text (social media text) tweet_image (corresponding image from the tweet) Total Samples Used: ~18,802(from the dataset) Class Labels: 0 → Non-informative 1 → Informative Collect only values where tweet_text and tweet_image are equal. (thus collected 12,743 tweets and convert it into test and train .pt files) ✅ Preprocessing Done Text: Tokenized using BERT tokenizer (bert-base-uncased) Extracted input_ids… See the full description on the dataset page: https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1.

  2. MuRIL Large tf

    • kaggle.com
    Updated Oct 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2021). MuRIL Large tf [Dataset]. https://www.kaggle.com/nbroad/muril-large-tf/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nicholas Broad
    Description

    This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids

    Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']

    MuRIL: Multilingual Representations for Indian Language

    MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

    Apache 2.0 License

    Link to model on Hugging Face Hub

    Overview

    This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

    We use a training paradigm similar to multilingual bert, with a few modifications as listed:

    We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

    Training

    The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

    Monolingual Data

    We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

    Parallel Data

    We have two types of parallel data - Translated Data
    We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
    - Transliterated Data
    We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

    We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

    The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

    Trainable parameters

    All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

    Uses & Limitations

    This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

    Citation

    @misc{khanuja2021muril,
       title={MuRIL: Multilingual Representations for Indian Languages},
       author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
       year={2021},
       eprint={2103.10730},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
    }
    

    References

    [1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

    [2]: Wikipedia

    [3]: Common Crawl

    [4]: PMINDIA

    [5]: Dakshina

    [6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

    [7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Henishma AR (2024). crisisMMD_cleaned_task1 [Dataset]. https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1

crisisMMD_cleaned_task1

Henishma/crisisMMD_cleaned_task1

Explore at:
Dataset updated
Aug 9, 2024
Authors
Henishma AR
Description

Source: CrisisMMD (Alam et al., 2017) Data Type: Multimodal — each sample includes: tweet_text (social media text) tweet_image (corresponding image from the tweet) Total Samples Used: ~18,802(from the dataset) Class Labels: 0 → Non-informative 1 → Informative Collect only values where tweet_text and tweet_image are equal. (thus collected 12,743 tweets and convert it into test and train .pt files) ✅ Preprocessing Done Text: Tokenized using BERT tokenizer (bert-base-uncased) Extracted input_ids… See the full description on the dataset page: https://huggingface.co/datasets/Henishma/crisisMMD_cleaned_task1.

Search
Clear search
Close search
Google apps
Main menu