13 datasets found
  1. Bio_ClinicalBERT

    • kaggle.com
    zip
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditi Dutta (2022). Bio_ClinicalBERT [Dataset]. https://www.kaggle.com/datasets/aditidutta/bio-clinicalbert
    Explore at:
    zip(806570272 bytes)Available download formats
    Dataset updated
    Apr 21, 2022
    Authors
    Aditi Dutta
    Description

    # ClinicalBERT - Bio + Clinical BERT Model

    The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.

    This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.

    Pretraining Data

    The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).

    Model Pretraining

    Note Preprocessing

    Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).

    Pretraining Hyperparameters

    We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

    How to use the model

    Load the model via the transformers library:

    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    

    More Information

    Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.

  2. BERT-base-multilingual-cased

    • kaggle.com
    zip
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditi Dutta (2021). BERT-base-multilingual-cased [Dataset]. https://www.kaggle.com/aditidutta/bert-base-multilingual-cased
    Explore at:
    zip(2329614912 bytes)Available download formats
    Dataset updated
    Jun 15, 2021
    Authors
    Aditi Dutta
    Description

    Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. This dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository and hosted on Kaggle. (104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters)

    NOTE : You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

    Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

    Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('PATH_TO_THIS_FILE')
    model = BertModel.from_pretrained("PATH_TO_THIS_FILE")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    and in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('PATH_TO_THIS_FILE')
    model = TFBertModel.from_pretrained("PATH_TO_THIS_FILE")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

    Acknowledgments

    All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.

    @article{DBLP:journals/corr/abs-1810-04805, author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}, journal = {CoRR}, volume = {abs/1810.04805}, year = {2018}, url = {http://arxiv.org/abs/1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

  3. h

    safety-qa-bert-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Khan Danish, safety-qa-bert-dataset [Dataset]. https://huggingface.co/datasets/adanish91/safety-qa-bert-dataset
    Explore at:
    Authors
    Abid Ali Khan Danish
    Description

    Safety QA Dataset

      Dataset Description
    

    There are two dataset that is publicaly available dataset from Mine Safety and Health Administration (MSHA). The 'seed_annotated_data.csv' dataset contains seed annotated data where the answer to the safety related questions are annotated in the accident narratives for initial training. The main 'training data.csv' data is used during the active learning (AL) process for question answering tasks in occupational safety and health… See the full description on the dataset page: https://huggingface.co/datasets/adanish91/safety-qa-bert-dataset.

  4. h

    VirBiCla-training

    • huggingface.co
    Updated Aug 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clelia Astra Bertelli (2024). VirBiCla-training [Dataset]. https://huggingface.co/datasets/as-cle-bert/VirBiCla-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2024
    Authors
    Clelia Astra Bertelli
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for VirBiCla-training

    VirBiCla is a ML-based viral DNA detector designed for long-read sequencing metagenomics. This dataset is a support dataset for training the base ML model.

      Dataset Details
    
    
    
    
    
      Dataset Sources [optional]
    

    Repository: GitHub repository for VirBiCla

      Uses
    

    This dataset is intended as support for training the base VirBiCla model

      Dataset Structure
    

    Dataset is a CSV file composed of 60.003 record sequences (coming… See the full description on the dataset page: https://huggingface.co/datasets/as-cle-bert/VirBiCla-training.

  5. h

    yue-wiki-pl-bert

    • huggingface.co
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hon9kon9ize (2025). yue-wiki-pl-bert [Dataset]. https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert
    Explore at:
    Dataset updated
    Apr 6, 2025
    Dataset authored and provided by
    hon9kon9ize
    Description

    Yue-Wiki-PL-BERT Dataset

      Overview
    

    This dataset contains processed text data from Cantonese Wikipedia articles, specifically formatted for training or fine-tuning BERT-like models for Cantonese language processing. The dataset is created by hon9kon9ize and contains approximately 176,177 rows of training data.

      Description
    

    The Yue-Wiki-PL-BERT dataset is a structured collection of Cantonese text data extracted from Wikipedia, with each entry containing:

    id: A… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert.

  6. h

    Symptom_Text_Labels

    • huggingface.co
    Updated Oct 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wisnu Afifuddin (2024). Symptom_Text_Labels [Dataset]. https://huggingface.co/datasets/InVoS/Symptom_Text_Labels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 20, 2024
    Authors
    Wisnu Afifuddin
    Description

    Dataset Card for Dataset Name

    Dataset for BERT Training Model

      Dataset Details
    

    This dataset contains sentence text and symptoms. I created it using a dataset I found on huggingface under the account name Venetis, then modified it to contain more text sentences and symptom labels.

      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More… See the full description on the dataset page: https://huggingface.co/datasets/InVoS/Symptom_Text_Labels.

  7. z

    Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...

    • zenodo.org
    bin, pdf, zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation" [Dataset]. http://doi.org/10.5281/zenodo.15002575
    Explore at:
    bin, zip, pdfAvailable download formats
    Dataset updated
    Nov 12, 2025
    Dataset provided by
    Arxiv
    Authors
    Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

    This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation

    In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

    How to cite:

    Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23

    IMPORTANT UPDATE!!!

    It is strongly advised to work with the following links, instead of working directly from Zenodo:

    • CODE REPOSITORY: This repository contains the code used for the article.

    • SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

    • HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

    If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.

    ----------------------------------------------------------------------------------------------

    Installation

    This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

    poetry install

    As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

    ----------------------------------------------------------------------------------------------

    Core code

    The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

    ----------------------------------------------------------------------------------------------

    Parameters

    All the parameters to create datasets and train models with the core code are found in the folder parameters.

    ----------------------------------------------------------------------------------------------

    Models

    Model types

    For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

    Model folder

    The model folder contains all the trained models for the paper. There are three types of models:

    • baseline: An XGBoost model that can be loaded with pickle.
    • roberta: BERTIN based models in pytorch. You can load them with the model_path
    • xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

    Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

    Load model

    Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

    from transformers import AutoModel
    
    model = AutoModel.from_pretrained('

    ----------------------------------------------------------------------------------------------

    Dataset

    labeled_final_dataset.jsonl

    This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

    Other datasets:

    The datasets can be found in the folder data that is divided in the following folders:

    base_dataset

    The splits to train, validate and test the models.

    splits_data

    Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

    sentence_data

    Pairs of sentences found in each corpus. They are used to generate splits_data.

    Dataset dictionary

    This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

    • sentence_1: First sentence of the pair.
    • sentence_2: Second sentence of the pair.
    • connector: Linking phrase used to extract pair.
    • connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"
    • extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.
    • distance: How many sentences before the connector is the sentence_1
    • sentence_1_position: Number of sentence for sentence_1 in the source document
    • sentence_1_paragraph: Number of paragraph for sentence_1 in the source document
    • sentence_2_position: Number of sentence for sentence_2 in the source document
    • sentence_2_paragraph: Number of paragraph for sentence_2 in the source document
    • id: Unique identifier for the example
    • dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.
    • genre: Writing genre of the dataset.
    • domain: Domain genre of the dataset.

    Example:

    {"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

    Dataset load

    To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

    from auto_nli.model.bert_based.dataset import BERTDataset
    dataset = BERTDataset(
    os.path.join(dataset_folder,
    max_len=
    model_type=

    only_premise=
    max_samples=

    ----------------------------------------------------------------------------------------------

    Notebooks

    The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.

  8. clr-finetuned-model-weights

    • kaggle.com
    zip
    Updated Jul 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurav Maheshkar ☕️ (2021). clr-finetuned-model-weights [Dataset]. https://www.kaggle.com/sauravmaheshkar/clrfinetunedmodelweights
    Explore at:
    zip(4356936181 bytes)Available download formats
    Dataset updated
    Jul 10, 2021
    Authors
    Saurav Maheshkar ☕️
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://github.com/SauravMaheshkar/CommonLit-Readibility/blob/main/assets/CommonLit%20-%20Big%20Banner.png?raw=true" alt="">

    FineTuning Metrics

    ArchitectureWeightsTraining LossValidation Loss
    roberta-basehuggingface/hub0.6410.4728
    bert-base-uncasedhuggingface/hub0.67810.4977
    albert-basehuggingface/hub0.71190.5155
    xlm-roberta-basehuggingface/hub0.72250.525
    bert-large-uncasedhuggingface/hub0.74820.5161
    albert-largehuggingface/hub1.0750.9921
    roberta-largehuggingface/hub2.7491.075
  9. h

    pre_train_odia_data_processed

    • huggingface.co
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OdiaGenAIdata (2024). pre_train_odia_data_processed [Dataset]. https://huggingface.co/datasets/OdiaGenAIdata/pre_train_odia_data_processed
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset authored and provided by
    OdiaGenAIdata
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    About

    This dataset is curated from different open-source datasets and prepared Odia data using different techniques (web scraping, OCR) and manually corrected by the Odia native speakers. The dataset is uniformly processed and contains duplicated entries which can be processed based on usage. For more details about the data, go through the blog post.

      Use Cases
    

    The dataset has many use cases such as:

    Pre-training Odia LLM, Building the Odia BERT model, Building Odia… See the full description on the dataset page: https://huggingface.co/datasets/OdiaGenAIdata/pre_train_odia_data_processed.

  10. MuRIL Large tf

    • kaggle.com
    zip
    Updated Oct 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2021). MuRIL Large tf [Dataset]. https://www.kaggle.com/nbroad/muril-large-tf
    Explore at:
    zip(1883316797 bytes)Available download formats
    Dataset updated
    Oct 16, 2021
    Authors
    Nicholas Broad
    Description

    This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids

    Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']

    MuRIL: Multilingual Representations for Indian Language

    MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

    Apache 2.0 License

    Link to model on Hugging Face Hub

    Overview

    This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

    We use a training paradigm similar to multilingual bert, with a few modifications as listed:

    We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

    Training

    The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

    Monolingual Data

    We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

    Parallel Data

    We have two types of parallel data - Translated Data
    We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
    - Transliterated Data
    We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

    We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

    The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

    Trainable parameters

    All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

    Uses & Limitations

    This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

    Citation

    @misc{khanuja2021muril,
       title={MuRIL: Multilingual Representations for Indian Languages},
       author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
       year={2021},
       eprint={2103.10730},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
    }
    

    References

    [1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

    [2]: Wikipedia

    [3]: Common Crawl

    [4]: PMINDIA

    [5]: Dakshina

    [6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

    [7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...

  11. MuRIL Large pt

    • kaggle.com
    zip
    Updated Oct 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2021). MuRIL Large pt [Dataset]. https://www.kaggle.com/datasets/nbroad/muril-large-pt/code
    Explore at:
    zip(1883086276 bytes)Available download formats
    Dataset updated
    Oct 16, 2021
    Authors
    Nicholas Broad
    Description

    MuRIL: Multilingual Representations for Indian Language

    MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

    Apache 2.0 License

    Link to model on Hugging Face Hub

    Overview

    This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

    We use a training paradigm similar to multilingual bert, with a few modifications as listed:

    We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

    Training

    The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

    Monolingual Data

    We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

    Parallel Data

    We have two types of parallel data - Translated Data
    We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
    - Transliterated Data
    We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

    We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

    The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

    Trainable parameters

    All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

    Uses & Limitations

    This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

    Citation

    @misc{khanuja2021muril,
       title={MuRIL: Multilingual Representations for Indian Languages},
       author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
       year={2021},
       eprint={2103.10730},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
    }
    

    References

    [1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

    [2]: Wikipedia

    [3]: Common Crawl

    [4]: PMINDIA

    [5]: Dakshina

    [6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

    [7]: Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).

    [8]: IndicTrans

    [9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080.

    [10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020). FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. arXiv preprint arXiv:2009.05166.

  12. Sentiment Analysis Dataset

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
    Explore at:
    zip(9105036 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    abdelmalek eladjelet
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

    📌 Description

    This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

    • 0 — Negative
    • 1 — Neutral
    • 2 — Positive

    The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
    https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

    The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

    📊 Columns

    ColumnDescription
    CommentUser-generated text content
    SentimentSentiment label (0=Negative, 1=Neutral, 2=Positive)

    🚀 Use Cases

    • 🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa
    • 🔍 Evaluate preprocessing and tokenization strategies
    • 📈 Benchmark NLP models on multi-class classification tasks
    • 🎓 Educational projects and research in opinion mining or text classification
    • 🧪 Fine-tune transformer models on a large and diverse sentiment dataset

    💬 Example

    Comment: "apple pay is so convenient secure and easy to use"
    Sentiment: 2 (Positive)
    
  13. h

    newsqa

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandipan, newsqa [Dataset]. https://huggingface.co/datasets/Sandipan1994/newsqa
    Explore at:
    Authors
    Sandipan
    Description

    🗂️ Dataset Card: newsqa

      📌 Dataset Summary
    

    The newsqa dataset is a question–answering (QA) dataset designed for extractive reading comprehension.Each example contains:

    context: A passage (typically news text) question: A natural-language question referring to the context answers: Ground-truth answer spans (answer_start and text) id: A unique identifier for each QA pair

    The dataset is suitable for training extractive QA models such as BERT-QA, RoBERTa-QA, LLaMA… See the full description on the dataset page: https://huggingface.co/datasets/Sandipan1994/newsqa.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aditi Dutta (2022). Bio_ClinicalBERT [Dataset]. https://www.kaggle.com/datasets/aditidutta/bio-clinicalbert
Organization logo

Bio_ClinicalBERT

Publicly Available Clinical BERT Embeddings from https://huggingface.co/

Explore at:
zip(806570272 bytes)Available download formats
Dataset updated
Apr 21, 2022
Authors
Aditi Dutta
Description

# ClinicalBERT - Bio + Clinical BERT Model

The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.

This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.

Pretraining Data

The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).

Model Pretraining

Note Preprocessing

Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).

Pretraining Hyperparameters

We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

How to use the model

Load the model via the transformers library:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

More Information

Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.

Search
Clear search
Close search
Google apps
Main menu