13 datasets found

Bio_ClinicalBERT
kaggle.com
zip
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditi Dutta (2022). Bio_ClinicalBERT [Dataset]. https://www.kaggle.com/datasets/aditidutta/bio-clinicalbert
Explore at:
zip(806570272 bytes)Available download formats
Dataset updated
Apr 21, 2022
Authors
Aditi Dutta
Description
# ClinicalBERT - Bio + Clinical BERT Model

The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.

This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.

Pretraining Data

The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).

Model Pretraining

Note Preprocessing

Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).

Pretraining Hyperparameters

We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

How to use the model

Load the model via the transformers library:

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

More Information

Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.
BERT-base-multilingual-cased
kaggle.com
zip
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditi Dutta (2021). BERT-base-multilingual-cased [Dataset]. https://www.kaggle.com/aditidutta/bert-base-multilingual-cased
Explore at:
zip(2329614912 bytes)Available download formats
Dataset updated
Jun 15, 2021
Authors
Aditi Dutta
Description
Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. This dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository and hosted on Kaggle. (104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters)

NOTE : You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('PATH_TO_THIS_FILE') model = BertModel.from_pretrained("PATH_TO_THIS_FILE") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('PATH_TO_THIS_FILE') model = TFBertModel.from_pretrained("PATH_TO_THIS_FILE") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Acknowledgments

All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.

@article{DBLP:journals/corr/abs-1810-04805, author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}, journal = {CoRR}, volume = {abs/1810.04805}, year = {2018}, url = {http://arxiv.org/abs/1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
h
safety-qa-bert-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abid Ali Khan Danish, safety-qa-bert-dataset [Dataset]. https://huggingface.co/datasets/adanish91/safety-qa-bert-dataset
Explore at:
Authors
Abid Ali Khan Danish
Description
Safety QA Dataset

Dataset Description

There are two dataset that is publicaly available dataset from Mine Safety and Health Administration (MSHA). The 'seed_annotated_data.csv' dataset contains seed annotated data where the answer to the safety related questions are annotated in the accident narratives for initial training. The main 'training data.csv' data is used during the active learning (AL) process for question answering tasks in occupational safety and health… See the full description on the dataset page: https://huggingface.co/datasets/adanish91/safety-qa-bert-dataset.
h
VirBiCla-training
huggingface.co
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clelia Astra Bertelli (2024). VirBiCla-training [Dataset]. https://huggingface.co/datasets/as-cle-bert/VirBiCla-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2024
Authors
Clelia Astra Bertelli
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for VirBiCla-training

VirBiCla is a ML-based viral DNA detector designed for long-read sequencing metagenomics. This dataset is a support dataset for training the base ML model.

Dataset Details Dataset Sources [optional]

Repository: GitHub repository for VirBiCla

Uses

This dataset is intended as support for training the base VirBiCla model

Dataset Structure

Dataset is a CSV file composed of 60.003 record sequences (coming… See the full description on the dataset page: https://huggingface.co/datasets/as-cle-bert/VirBiCla-training.
h
yue-wiki-pl-bert
huggingface.co
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hon9kon9ize (2025). yue-wiki-pl-bert [Dataset]. https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert
Explore at:
Dataset updated
Apr 6, 2025
Dataset authored and provided by
hon9kon9ize
Description
Yue-Wiki-PL-BERT Dataset

Overview

This dataset contains processed text data from Cantonese Wikipedia articles, specifically formatted for training or fine-tuning BERT-like models for Cantonese language processing. The dataset is created by hon9kon9ize and contains approximately 176,177 rows of training data.

Description

The Yue-Wiki-PL-BERT dataset is a structured collection of Cantonese text data extracted from Wikipedia, with each entry containing:

id: A… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert.
h
Symptom_Text_Labels
huggingface.co
Updated Oct 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wisnu Afifuddin (2024). Symptom_Text_Labels [Dataset]. https://huggingface.co/datasets/InVoS/Symptom_Text_Labels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2024
Authors
Wisnu Afifuddin
Description
Dataset Card for Dataset Name

Dataset for BERT Training Model

Dataset Details

This dataset contains sentence text and symptoms. I created it using a dataset I found on huggingface under the account name Venetis, then modified it to contain more text sentences and symptom labels.

Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More… See the full description on the dataset page: https://huggingface.co/datasets/InVoS/Symptom_Text_Labels.
z
Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...
zenodo.org
bin, pdf, zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation" [Dataset]. http://doi.org/10.5281/zenodo.15002575
Explore at:
bin, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15002575
Dataset updated
Nov 12, 2025
Dataset provided by
Arxiv
Authors
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation

In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

How to cite:

Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23

IMPORTANT UPDATE!!!

It is strongly advised to work with the following links, instead of working directly from Zenodo:

CODE REPOSITORY: This repository contains the code used for the article.

SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.

----------------------------------------------------------------------------------------------

Installation

This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

poetry install

As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

----------------------------------------------------------------------------------------------

Core code

The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

----------------------------------------------------------------------------------------------

Parameters

All the parameters to create datasets and train models with the core code are found in the folder parameters.

----------------------------------------------------------------------------------------------

Models

Model types

For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

RoBERTa (BERTIN): https://huggingface.co/bertin-project/bertin-roberta-base-spanish

XLMRoBERTa: https://huggingface.co/FacebookAI/xlm-roberta-base

Model folder

The model folder contains all the trained models for the paper. There are three types of models:

baseline: An XGBoost model that can be loaded with pickle.

roberta: BERTIN based models in pytorch. You can load them with the model_path

xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

Load model

Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

from transformers import AutoModel model = AutoModel.from_pretrained('

----------------------------------------------------------------------------------------------

Dataset

labeled_final_dataset.jsonl

This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

Other datasets:

The datasets can be found in the folder data that is divided in the following folders:

base_dataset

The splits to train, validate and test the models.

splits_data

Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

sentence_data

Pairs of sentences found in each corpus. They are used to generate splits_data.

Dataset dictionary

This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

sentence_1: First sentence of the pair.

sentence_2: Second sentence of the pair.

connector: Linking phrase used to extract pair.

connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"

extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.

distance: How many sentences before the connector is the sentence_1

sentence_1_position: Number of sentence for sentence_1 in the source document

sentence_1_paragraph: Number of paragraph for sentence_1 in the source document

sentence_2_position: Number of sentence for sentence_2 in the source document

sentence_2_paragraph: Number of paragraph for sentence_2 in the source document

id: Unique identifier for the example

dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.

genre: Writing genre of the dataset.

domain: Domain genre of the dataset.

Example:

{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

Dataset load

To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

from auto_nli.model.bert_based.dataset import BERTDataset

dataset = BERTDataset(

os.path.join(dataset_folder,

max_len=

model_type=

only_premise=

max_samples=

----------------------------------------------------------------------------------------------

Notebooks

The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.

clr-finetuned-model-weights

kaggle.com

zip

Updated Jul 10, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Saurav Maheshkar ☕️ (2021). clr-finetuned-model-weights [Dataset]. https://www.kaggle.com/sauravmaheshkar/clrfinetunedmodelweights

Explore at:

zip(4356936181 bytes)Available download formats

Dataset updated

Jul 10, 2021

Authors

Saurav Maheshkar ☕️

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://github.com/SauravMaheshkar/CommonLit-Readibility/blob/main/assets/CommonLit%20-%20Big%20Banner.png?raw=true" alt="">

FineTuning Metrics

Architecture	Weights	Training Loss	Validation Loss
roberta-base	huggingface/hub	0.641	0.4728
bert-base-uncased	huggingface/hub	0.6781	0.4977
albert-base	huggingface/hub	0.7119	0.5155
xlm-roberta-base	huggingface/hub	0.7225	0.525
bert-large-uncased	huggingface/hub	0.7482	0.5161
albert-large	huggingface/hub	1.075	0.9921
roberta-large	huggingface/hub	2.749	1.075

h
pre_train_odia_data_processed
huggingface.co
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OdiaGenAIdata (2024). pre_train_odia_data_processed [Dataset]. https://huggingface.co/datasets/OdiaGenAIdata/pre_train_odia_data_processed
Explore at:
Dataset updated
Nov 10, 2024
Dataset authored and provided by
OdiaGenAIdata
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
About

This dataset is curated from different open-source datasets and prepared Odia data using different techniques (web scraping, OCR) and manually corrected by the Odia native speakers. The dataset is uniformly processed and contains duplicated entries which can be processed based on usage. For more details about the data, go through the blog post.

Use Cases

The dataset has many use cases such as:

Pre-training Odia LLM, Building the Odia BERT model, Building Odia… See the full description on the dataset page: https://huggingface.co/datasets/OdiaGenAIdata/pre_train_odia_data_processed.
MuRIL Large tf
kaggle.com
zip
Updated Oct 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2021). MuRIL Large tf [Dataset]. https://www.kaggle.com/nbroad/muril-large-tf
Explore at:
zip(1883316797 bytes)Available download formats
Dataset updated
Oct 16, 2021
Authors
Nicholas Broad
Description
This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']

MuRIL: Multilingual Representations for Indian Language

MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

Apache 2.0 License

Link to model on Hugging Face Hub

Overview

This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

We use a training paradigm similar to multilingual bert, with a few modifications as listed:

We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

Training

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

Monolingual Data

We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

Parallel Data

We have two types of parallel data - Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

Trainable parameters

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

Uses & Limitations

This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

Citation

@misc{khanuja2021muril, title={MuRIL: Multilingual Representations for Indian Languages}, author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar}, year={2021}, eprint={2103.10730}, archivePrefix={arXiv}, primaryClass={cs.CL} }

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...
MuRIL Large pt
kaggle.com
zip
Updated Oct 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2021). MuRIL Large pt [Dataset]. https://www.kaggle.com/datasets/nbroad/muril-large-pt/code
Explore at:
zip(1883086276 bytes)Available download formats
Dataset updated
Oct 16, 2021
Authors
Nicholas Broad
Description
MuRIL: Multilingual Representations for Indian Language

MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

Apache 2.0 License

Link to model on Hugging Face Hub

Overview

This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

We use a training paradigm similar to multilingual bert, with a few modifications as listed:

We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

Training

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

Monolingual Data

We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

Parallel Data

We have two types of parallel data - Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

Trainable parameters

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

Uses & Limitations

This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

Citation

@misc{khanuja2021muril, title={MuRIL: Multilingual Representations for Indian Languages}, author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar}, year={2021}, eprint={2103.10730}, archivePrefix={arXiv}, primaryClass={cs.CL} }

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).

[8]: IndicTrans

[9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080.

[10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020). FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. arXiv preprint arXiv:2009.05166.
Sentiment Analysis Dataset
kaggle.com
zip
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
Explore at:
zip(9105036 bytes)Available download formats
Dataset updated
May 3, 2025
Authors
abdelmalek eladjelet
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

📌 Description

This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

0 — Negative

1 — Neutral

2 — Positive

The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

📊 Columns

Column Description
Comment User-generated text content
Sentiment Sentiment label (0=Negative, 1=Neutral, 2=Positive)

🚀 Use Cases

🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa

🔍 Evaluate preprocessing and tokenization strategies

📈 Benchmark NLP models on multi-class classification tasks

🎓 Educational projects and research in opinion mining or text classification

🧪 Fine-tune transformer models on a large and diverse sentiment dataset

💬 Example

Comment: "apple pay is so convenient secure and easy to use" Sentiment: 2 (Positive)
h
newsqa
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandipan, newsqa [Dataset]. https://huggingface.co/datasets/Sandipan1994/newsqa
Explore at:
Authors
Sandipan
Description
🗂️ Dataset Card: newsqa

📌 Dataset Summary

The newsqa dataset is a question–answering (QA) dataset designed for extractive reading comprehension.Each example contains:

context: A passage (typically news text) question: A natural-language question referring to the context answers: Ground-truth answer spans (answer_start and text) id: A unique identifier for each QA pair

The dataset is suitable for training extractive QA models such as BERT-QA, RoBERTa-QA, LLaMA… See the full description on the dataset page: https://huggingface.co/datasets/Sandipan1994/newsqa.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Column	Description
`Comment`	User-generated text content
`Sentiment`	Sentiment label (0=Negative, 1=Neutral, 2=Positive)

Facebook

Twitter

Click to copy link

Link copied

Cite

Aditi Dutta (2022). Bio_ClinicalBERT [Dataset]. https://www.kaggle.com/datasets/aditidutta/bio-clinicalbert

Bio_ClinicalBERT

Publicly Available Clinical BERT Embeddings from https://huggingface.co/

Explore at:

zip(806570272 bytes)Available download formats

Dataset updated

Apr 21, 2022

Authors

Aditi Dutta

Description

# ClinicalBERT - Bio + Clinical BERT Model

The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.

This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.

Pretraining Data

The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).

Model Pretraining

Note Preprocessing

Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).

Pretraining Hyperparameters

We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

How to use the model

Load the model via the transformers library:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

More Information

Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.

Clear search

Close search

Google apps

Main menu

Bio_ClinicalBERT

Pretraining Data

Model Pretraining

Note Preprocessing

Pretraining Hyperparameters

How to use the model

More Information

BERT-base-multilingual-cased

safety-qa-bert-dataset

VirBiCla-training

yue-wiki-pl-bert

Symptom_Text_Labels

Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks...

ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

How to cite:

IMPORTANT UPDATE!!!

It is strongly advised to work with the following links, instead of working directly from Zenodo:

CODE REPOSITORY: This repository contains the code used for the article.

SMALL EXAMPLE REPOSITORY: This repository contains a small code example showing you how to train, and predict using a very small toy dataset, with the same structure.

HUGGING FACE COLLECTION: Huggingface collection containing the dataset and models.

Installation

Core code

Parameters

Models

Model types

Model folder

Load model

Dataset

labeled_final_dataset.jsonl

Other datasets:

base_dataset

splits_data

sentence_data

Dataset dictionary

Dataset load

Notebooks

clr-finetuned-model-weights

FineTuning Metrics

pre_train_odia_data_processed

MuRIL Large tf

MuRIL: Multilingual Representations for Indian Language

Overview

Training

Monolingual Data

Parallel Data

Trainable parameters

Uses & Limitations

Citation

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...

MuRIL Large pt

MuRIL: Multilingual Representations for Indian Language

Overview

Training

Monolingual Data

Parallel Data

Trainable parameters

Uses & Limitations

Citation

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).

[8]: IndicTrans

[9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080.

[10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020). FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. arXiv preprint arXiv:2009.05166.

Sentiment Analysis Dataset

🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

📌 Description

📊 Columns