Facebook
Twittera2labs-ai/lenny-functional-torch dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
torch-uncertainty/Places365-C dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterapssg96/torch-to-manim dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">
Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)
We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.
First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API:
pip install kaggle
Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run:
kaggle datasets download xhlulu/medal-emnlp
Now, unzip everything and place them inside the data directory:
unzip -nq crawl-300d-2M-subword.zip -d data
mv data/pretrain_sample/* data/
For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights:
wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
unzip -nq data/crawl-300d-2M-subword.zip -d data/
You can directly load LSTM and LSTM-SA with torch.hub:
```python
import torch
lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```
If you want to use the Electra model, you need to first install transformers:
pip install transformers
Then, you can load it with torch.hub:
python
import torch
electra = torch.hub.load("BruceWen120/medal", "electra")
transformersIf you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
Download the bibtex here, or copy the text below:
@inproceedings{wen-etal-2020-medal,
title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
pages = "130--135",
}
The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.
The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:
INTRODUCTION
Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.
MEDLINE/PUBMED SPECIFIC TERMS
NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.
GENERAL TERMS AND CONDITIONS
Users of the data agree to:
- acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
- properly use registration and/or trademark symbols when referring to NLM products, and
- not indicate or imply that NLM has endorsed its products/services/applications.
Users who republish or redistribute the data (services, products or raw data) agree to:
- maintain the most current version of all distributed data, or
- make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.
NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.
NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
Facebook
TwitterSource: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1 https://huggingface.co/xlm-roberta-base
import torch
from sentence_transformers import SentenceTransformer
if torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
model_path = "../input/sbert-models/paraphrase-multilingual-mpnet-base-v2"
model = SentenceTransformer(model_path, device=device)
embedding = model.encode(text, device=device)
Facebook
Twittertorch-uncertainty/ood-datasets-splits dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterTorchBench
The TorchBench suite of BackendBench is designed to mimic real-world use cases. It provides operators and inputs derived from 155 model traces found in TIMM (67), Hugging Face Transformers (45), and TorchBench (43). (These are also the models PyTorch developers use to validate performance.) You can view the origin of these traces by switching the subset in the dataset viewer to ops_traces_models and torchbench for the full dataset. When running BackendBench, much of the… See the full description on the dataset page: https://huggingface.co/datasets/GPUMODE/backendbench_tests.
Facebook
TwitterFFHQ Dataset (pravsels/FFHQ_1024) encoded using the dc-ae-f128c512-mix-1.0 auto encoder. Example usage import sys sys.path.append('../dcae') # https://github.com/vladmandic/dcae from dcae import DCAE
from datasets import load_dataset import torch import torchvision
dataset = load_dataset("SwayStar123/FFHQ_1024_DC-AE_f128", split="train") dc_ae = DCAE("dc-ae-f128c512-mix-1.0", device="cuda", dtype=torch.bfloat16).eval() # Must be bfloat. with float16 it produces terrible outputs.
def… See the full description on the dataset page: https://huggingface.co/datasets/SwayStar123/FFHQ_1024_DC-AE_f128.
Facebook
TwitterThis dataset contains ProtT5 model vector embeddings for protein sequences of CAFA 5 ProtBERT :https://huggingface.co/Rostlab/prot_bert ProtT5:https://huggingface.co/Rostlab/prot_t5_xl_uniref50
When the model is used for feature extraction,The two models achieve the following results, respectively:
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|---|---|---|---|---|
| CASP12 | 75 | 63 | ||
| TS115 | 83 | 72 | ||
| CB513 | 81 | 66 | ||
| DeepLoc | 79 | 91 |
Test results :
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|---|---|---|---|---|
| CASP12 | 81 | 70 | ||
| TS115 | 87 | 77 | ||
| CB513 | 86 | 74 | ||
| DeepLoc | 81 | 91 |
Thanks For protbert-embeddings-for-cafa5 from @henriupton
notebook ```python print("Load ProtBERT Model...")
from transformers import T5Tokenizer, T5EncoderModel import torch from Bio import SeqIO import re MAIN_DIR = "/kaggle/input/cafa-5-protein-function-prediction"
import pandas as pd import numpy as np from tqdm import tqdm import time import matplotlib.pyplot as plt import gc
import wandb import random
import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') print("Using device: {}".format(device)) transformer_link = "Rostlab/prot_t5_xl_half_uniref50-enc" print("Loading: {}".format(transformer_link)) model = T5EncoderModel.from_pretrained(transformer_link) model.full() if device=='cpu' else model.half() # only cast to full-precision if no GPU is available model = model.to(device) model = model.eval() tokenizer = T5Tokenizer.from_pretrained(transformer_link, do_lower_case=False )
class config: train_sequences_path = MAIN_DIR + "/Train/train_sequences.fasta" train_labels_path = MAIN_DIR + "/Train/train_terms.tsv" test_sequences_path = MAIN_DIR + "/Test (Targets)/testsuperset.fasta"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def get_bert_embedding( sequence : str, len_seq_limit : int ): ''' Function to collect last hidden state embedding vector from pre-trained ProtBERT Model
INPUTS:
- sequence (str) : protein sequence (ex : AAABBB) from fasta file
- len_seq_limit (int) : maximum sequence lenght (i.e nb of letters) for truncation
OUTPUTS:
- output_hidden : last hidden state embedding vector for input sequence of length 1024
'''
sequence_w_spaces = ' '.join(list(sequence))
encoded_input = tokenizer(
sequence_w_spaces,
truncation=True,
max_length=len_seq_limit,
padding='max_length',
return_tensors='pt').to(config.device)
output = model(**encoded_input)
output_hidden = output['last_hidden_state'][:,0][0].detach().cpu().numpy()
assert len(output_hidden)==1024
return output_hidden
print("Loading train set ProtBERT Embeddings...") fasta_train = SeqIO.parse(config.train_sequences_path, "fasta") print("Total Nb of Elements : ", len(list(fasta_train))) fasta_train = SeqIO.parse(config.train_sequences_path, "fasta") ids_list = [] embed_vects_list = [] t0 = time.time() checkpoint = 0 for item in tqdm(fasta_train, total = 142246): ids_list.append(item.id) embed_vects_list.append( get_bert_embedding(sequence = item.seq, len_seq_limit = 1200)) checkpoint+=1 if checkpoint>=100: df_res = pd.DataFrame(data={"id" : ids_list, "embed_vect" : embed_vects_list}) np.save('/kaggle/working/train_ids.npy',np.array(ids_list)) np.save('/kaggle/working/train_embeddings.npy',np.array(embed_vects_list)) checkpoint=0
np.save('/kaggle/working/train_ids.npy',np.array(ids_list)) np.save('/kaggle/working/train_embeddings.npy',np.array(embed_vects_list)) print('Total Elapsed Time:',time.time()-t0)
print("Loading test set ProtBERT Embeddings...") fasta_test = SeqIO.parse(config.test_sequences_path, "fasta") print("Total Nb of Elements : ...
Facebook
TwitterDataset Description
Tiny ImageNet is a reduced version of the original ImageNet dataset, containing 200 classes (a subset of the 1,000 ImageNet categories)
Homepage: https://www.image-net.org/
Citation
@inproceedings{deng2009imagenet, title={ImageNet: A large-scale hierarchical image database}, author={Deng, Jia and others}, booktitle={CVPR}, year={2009} }
Facebook
Twittertraining Code ```Python
from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')
train.Misconception = train.Misconception.fillna('NA')
train['target'] = train.Category + ":" + train.Misconception
le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()
idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers
train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch
Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)
tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)
def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )
train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )
from datasets import Dataset
COLS = ['text', 'label']
train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'
train_df_clean['label'] = train_df_clean['label'].astype(np.int64)
train_df_clean = train_df_clean.reset_index(drop=True)
train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)
def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)
train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
import os from huggingface_hub import scan_cache_dir
cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil
os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)
training_args = TrainingArguments(
output_dir=f"{TEMP_DIR}/training_output/",
do_train=True,
do_eval=False,
save_strategy="no",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=5e-5,
logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500,
bf16=True,
fp16=False,
report_to="none",
warmup_ratio=0.1,
lr_scheduler_type="cosine",
dataloader_pin_memory=False,
gradient_checkpointing=True,
)
def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()
# Get top 3 predicted class indi...
Facebook
Twitter8k context length embedding model from https://huggingface.co/nomic-ai/nomic-embed-text-v1.5. For offline use in Kaggle notebooks.
Useful for trialing embeddings based search over larger texts. Check my notebook for example: https://www.kaggle.com/code/donkeys/q-a-q-from-summaries-a-from-writeups
See the model page for more details: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
Some example code (check my notebook above and model page for more details):
from sentence_transformers import util
from sentence_transformers import SentenceTransformer
import torch
embed_model = SentenceTransformer("/kaggle/input/nomic-embed-text-v1-5-model/nomic-embed-text-v1.5", trust_remote_code=True)
search_q = f"search_query: {search_q_desc}"
def encode_in_batches(my_model, documents, batch_size=16):
embeddings = []
for i in tqdm(range(0, len(documents), batch_size)):
batch = list(documents[i:i+batch_size].values)
batch_embeddings = my_model.encode(batch, convert_to_tensor=True)
embeddings.extend(batch_embeddings)
torch.cuda.empty_cache()
return embeddings
df["search_doc"] = df["desc"].apply(lambda x: "search_document: "+x)
doc_embeddings = encode_in_batches(embed_model, df["search_doc"])
embeddings = embed_model.encode(sentences, convert_to_tensor=True)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
similarities = util.cos_sim(q_embeddings[0], doc_embeddings)
similarities = similarities[0]
top_n = 50
top_k_values, top_k_indices = torch.topk(similarities, top_n)
top_k_rows = []
top_k_indices = top_k_indices[0]
for idx in top_k_indices:
top_row = df.iloc[int(idx)]
top_k_rows.append(top_row)
...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://github.com/deepseek-ai/DeepSeek-Math https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "/kaggle/input/deepseek-math-7b-instruct/deepseek-math-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
messages = [
{"role": "user", "content": "what is the integral of x^2 from 0 to 2?
Please reason step by step, and put your final answer within \\boxed{}."}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
Facebook
Twitter!pip install datasets transformers torch pandas import datasets import transformers import torch import pandas as pd pip show datasets pip show transformers pip show torch pip show pandas
Install required dependencies
Run this in your terminal or in a separate cell:
!pip install datasets transformers torch pandas
import pandas as pd from datasets import Dataset from transformers import MarianMTModel, MarianTokenizer, TrainingArguments, Trainer import torch… See the full description on the dataset page: https://huggingface.co/datasets/mini1234/subtitles.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for "torch-forum"
Dataset structure { title:str category:str, posts:List[{ poster:str, contents:str, likes:int, isAccepted:bool }] }
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The corrupted version of the Flowers102 fined-grained classification dataset.
How to use this dataset
For all the corruptions, extract the tar.gz files with the following command: for f in *.tar.gz; do tar -xzf "$f" && rm "$f"; done
License
The license of the original dataset is unclear.
Citation
If you use this dataset please consider citing: The authors of the original dataset, @inproceedings{nilsback2008automated, title={Automated flower… See the full description on the dataset page: https://huggingface.co/datasets/torch-uncertainty/Flowers102-C.
Facebook
Twittereitanturok/API-Bench-TorchHub dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Description
The Describable Textures Dataset (DTD) includes 5,640 images of textures annotated with human-centric attributes. This split is derived from the OpenOOD benchmark OOD evaluation splits.
Homepage: https://www.robots.ox.ac.uk/~vgg/data/dtd/ OpenOOD Benchmark: https://github.com/Jingkang50/OpenOOD/
Citation
@inproceedings{cimpoi2014describing, title={Describing textures in the wild}, author={Cimpoi, Mircea and others}, booktitle={CVPR}… See the full description on the dataset page: https://huggingface.co/datasets/torch-uncertainty/Texture.
Facebook
Twittersiro1/example-dataset-H100-Qwen3-Coder-30B-A3B-Instruct-torchbench dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Description
SSB-Hard is an OOD dataset focusing on semantically similar but distinct categories to ImageNet. This split is derived from the OpenOOD benchmark OOD evaluation splits.
OpenOOD Benchmark: https://github.com/Jingkang50/OpenOOD/
Citation
@inproceedings{vaze2022openset, title={Open-set Recognition: A Good Closed-set Classifier is All You Need?}, author={Vaze, Siddharth and others}, booktitle={ICLR}, year={2022} } @inproceedings{yang2022openood… See the full description on the dataset page: https://huggingface.co/datasets/torch-uncertainty/SSB_hard.
Facebook
Twittera2labs-ai/lenny-functional-torch dataset hosted on Hugging Face and contributed by the HF Datasets community