Facebook
Twitter# ClinicalBERT - Bio + Clinical BERT Model
The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.
This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.
The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).
Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).
We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).
Load the model via the transformers library:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.
Facebook
TwitterPretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. This dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository and hosted on Kaggle. (104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters)
NOTE : You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('PATH_TO_THIS_FILE')
model = BertModel.from_pretrained("PATH_TO_THIS_FILE")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('PATH_TO_THIS_FILE')
model = TFBertModel.from_pretrained("PATH_TO_THIS_FILE")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
Acknowledgments
All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.
@article{DBLP:journals/corr/abs-1810-04805, author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}, journal = {CoRR}, volume = {abs/1810.04805}, year = {2018}, url = {http://arxiv.org/abs/1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Facebook
TwitterSafety QA Dataset
Dataset Description
There are two dataset that is publicaly available dataset from Mine Safety and Health Administration (MSHA). The 'seed_annotated_data.csv' dataset contains seed annotated data where the answer to the safety related questions are annotated in the accident narratives for initial training. The main 'training data.csv' data is used during the active learning (AL) process for question answering tasks in occupational safety and health… See the full description on the dataset page: https://huggingface.co/datasets/adanish91/safety-qa-bert-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for VirBiCla-training
VirBiCla is a ML-based viral DNA detector designed for long-read sequencing metagenomics. This dataset is a support dataset for training the base ML model.
Dataset Details
Dataset Sources [optional]
Repository: GitHub repository for VirBiCla
Uses
This dataset is intended as support for training the base VirBiCla model
Dataset Structure
Dataset is a CSV file composed of 60.003 record sequences (coming… See the full description on the dataset page: https://huggingface.co/datasets/as-cle-bert/VirBiCla-training.
Facebook
TwitterYue-Wiki-PL-BERT Dataset
Overview
This dataset contains processed text data from Cantonese Wikipedia articles, specifically formatted for training or fine-tuning BERT-like models for Cantonese language processing. The dataset is created by hon9kon9ize and contains approximately 176,177 rows of training data.
Description
The Yue-Wiki-PL-BERT dataset is a structured collection of Cantonese text data extracted from Wikipedia, with each entry containing:
id: A… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert.
Facebook
TwitterDataset Card for Dataset Name
Dataset for BERT Training Model
Dataset Details
This dataset contains sentence text and symptoms. I created it using a dataset I found on huggingface under the account name Venetis, then modified it to contain more text sentences and symptom labels.
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More… See the full description on the dataset page: https://huggingface.co/datasets/InVoS/Symptom_Text_Labels.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation
In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.
Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23
If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.
----------------------------------------------------------------------------------------------
This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:
poetry install
As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.
----------------------------------------------------------------------------------------------
The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md
----------------------------------------------------------------------------------------------
All the parameters to create datasets and train models with the core code are found in the folder parameters.
----------------------------------------------------------------------------------------------
For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:
The model folder contains all the trained models for the paper. There are three types of models:
Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )
Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:
from transformers import AutoModel model = AutoModel.from_pretrained('
----------------------------------------------------------------------------------------------
This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.
The datasets can be found in the folder data that is divided in the following folders:
The splits to train, validate and test the models.
Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.
Pairs of sentences found in each corpus. They are used to generate splits_data.
This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:
Example:
{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}
To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset
from auto_nli.model.bert_based.dataset import BERTDataset
dataset = BERTDataset(os.path.join(dataset_folder, max_len=model_type=only_premise=max_samples=----------------------------------------------------------------------------------------------
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://github.com/SauravMaheshkar/CommonLit-Readibility/blob/main/assets/CommonLit%20-%20Big%20Banner.png?raw=true" alt="">
| Architecture | Weights | Training Loss | Validation Loss |
|---|---|---|---|
| roberta-base | huggingface/hub | 0.641 | 0.4728 |
| bert-base-uncased | huggingface/hub | 0.6781 | 0.4977 |
| albert-base | huggingface/hub | 0.7119 | 0.5155 |
| xlm-roberta-base | huggingface/hub | 0.7225 | 0.525 |
| bert-large-uncased | huggingface/hub | 0.7482 | 0.5161 |
| albert-large | huggingface/hub | 1.075 | 0.9921 |
| roberta-large | huggingface/hub | 2.749 | 1.075 |
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
About
This dataset is curated from different open-source datasets and prepared Odia data using different techniques (web scraping, OCR) and manually corrected by the Odia native speakers. The dataset is uniformly processed and contains duplicated entries which can be processed based on usage. For more details about the data, go through the blog post.
Use Cases
The dataset has many use cases such as:
Pre-training Odia LLM, Building the Odia BERT model, Building Odia… See the full description on the dataset page: https://huggingface.co/datasets/OdiaGenAIdata/pre_train_odia_data_processed.
Facebook
TwitterThis was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']
MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.
Apache 2.0 License
Link to model on Hugging Face Hub
This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.
We use a training paradigm similar to multilingual bert, with a few modifications as listed:
We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.
The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below
We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
We have two types of parallel data
- Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.
We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.
The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.
All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.
This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.
@misc{khanuja2021muril,
title={MuRIL: Multilingual Representations for Indian Languages},
author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
year={2021},
eprint={2103.10730},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Facebook
TwitterMuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.
Apache 2.0 License
Link to model on Hugging Face Hub
This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.
We use a training paradigm similar to multilingual bert, with a few modifications as listed:
We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.
The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below
We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
We have two types of parallel data
- Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.
We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.
The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.
All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.
This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.
@misc{khanuja2021muril,
title={MuRIL: Multilingual Representations for Indian Languages},
author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
year={2021},
eprint={2103.10730},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:
The Data has been gathered from multiple websites such as :
Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset
Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis
https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.
| Column | Description |
|---|---|
Comment | User-generated text content |
Sentiment | Sentiment label (0=Negative, 1=Neutral, 2=Positive) |
Comment: "apple pay is so convenient secure and easy to use"
Sentiment: 2 (Positive)
Facebook
Twitter🗂️ Dataset Card: newsqa
📌 Dataset Summary
The newsqa dataset is a question–answering (QA) dataset designed for extractive reading comprehension.Each example contains:
context: A passage (typically news text) question: A natural-language question referring to the context answers: Ground-truth answer spans (answer_start and text) id: A unique identifier for each QA pair
The dataset is suitable for training extractive QA models such as BERT-QA, RoBERTa-QA, LLaMA… See the full description on the dataset page: https://huggingface.co/datasets/Sandipan1994/newsqa.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitter# ClinicalBERT - Bio + Clinical BERT Model
The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.
This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.
The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).
Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).
We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).
Load the model via the transformers library:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.