62 datasets found

lenny-functional-torch
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A2 Labs, Inc., lenny-functional-torch [Dataset]. https://huggingface.co/datasets/a2labs-ai/lenny-functional-torch
Explore at:
Dataset provided by
MAKO
Authors
A2 Labs, Inc.
Description
a2labs-ai/lenny-functional-torch dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Places365-C
huggingface.co
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TorchUncertainty (2025). Places365-C [Dataset]. https://huggingface.co/datasets/torch-uncertainty/Places365-C
Explore at:
Dataset updated
Mar 21, 2025
Dataset authored and provided by
TorchUncertainty
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
torch-uncertainty/Places365-C dataset hosted on Hugging Face and contributed by the HF Datasets community
h
torch-to-manim
huggingface.co
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
apssg96 (2025). torch-to-manim [Dataset]. https://huggingface.co/datasets/apssg96/torch-to-manim
Explore at:
Dataset updated
Sep 20, 2025
Authors
apssg96
Description
apssg96/torch-to-manim dataset hosted on Hugging Face and contributed by the HF Datasets community
MeDAL Dataset
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
Explore at:
zip(7324382521 bytes)Available download formats
Dataset updated
Nov 16, 2020
Authors
xhlulu
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

Loading FastText Embeddings

For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("xhlu/electra-medal") tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

Users of the data agree to:

acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,

properly use registration and/or trademark symbols when referring to NLM products, and

not indicate or imply that NLM has endorsed its products/services/applications.

Users who republish or redistribute the data (services, products or raw data) agree to:

maintain the most current version of all distributed data, or

make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
sbert models
kaggle.com
zip
Updated Jun 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apolo (2022). sbert models [Dataset]. https://www.kaggle.com/datasets/shkanda/sbert-models
Explore at:
zip(3139819022 bytes)Available download formats
Dataset updated
Jun 16, 2022
Authors
Apolo
Description
Source: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1 https://huggingface.co/xlm-roberta-base

Code example

import torch from sentence_transformers import SentenceTransformer if torch.cuda.is_available(): device = torch.device('cuda') else: device = torch.device('cpu') model_path = "../input/sbert-models/paraphrase-multilingual-mpnet-base-v2" model = SentenceTransformer(model_path, device=device) embedding = model.encode(text, device=device)
h
ood-datasets-splits
huggingface.co
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TorchUncertainty (2023). ood-datasets-splits [Dataset]. https://huggingface.co/datasets/torch-uncertainty/ood-datasets-splits
Explore at:
Dataset updated
Jun 20, 2023
Dataset authored and provided by
TorchUncertainty
Description
torch-uncertainty/ood-datasets-splits dataset hosted on Hugging Face and contributed by the HF Datasets community
h
backendbench_tests
huggingface.co
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GPU MODE (2025). backendbench_tests [Dataset]. https://huggingface.co/datasets/GPUMODE/backendbench_tests
Explore at:
Dataset updated
Sep 6, 2025
Dataset authored and provided by
GPU MODE
Description
TorchBench

The TorchBench suite of BackendBench is designed to mimic real-world use cases. It provides operators and inputs derived from 155 model traces found in TIMM (67), Hugging Face Transformers (45), and TorchBench (43). (These are also the models PyTorch developers use to validate performance.) You can view the origin of these traces by switching the subset in the dataset viewer to ops_traces_models and torchbench for the full dataset. When running BackendBench, much of the… See the full description on the dataset page: https://huggingface.co/datasets/GPUMODE/backendbench_tests.
h
FFHQ_1024_DC-AE_f128
huggingface.co
Updated Dec 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sway (2024). FFHQ_1024_DC-AE_f128 [Dataset]. https://huggingface.co/datasets/SwayStar123/FFHQ_1024_DC-AE_f128
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Authors
Sway
Description
FFHQ Dataset (pravsels/FFHQ_1024) encoded using the dc-ae-f128c512-mix-1.0 auto encoder. Example usage import sys sys.path.append('../dcae') # https://github.com/vladmandic/dcae from dcae import DCAE

from datasets import load_dataset import torch import torchvision

dataset = load_dataset("SwayStar123/FFHQ_1024_DC-AE_f128", split="train") dc_ae = DCAE("dc-ae-f128c512-mix-1.0", device="cuda", dtype=torch.bfloat16).eval() # Must be bfloat. with float16 it produces terrible outputs.

def… See the full description on the dataset page: https://huggingface.co/datasets/SwayStar123/FFHQ_1024_DC-AE_f128.
protT5-embedding-for-cafa5
kaggle.com
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horikitasaku (2023). protT5-embedding-for-cafa5 [Dataset]. https://www.kaggle.com/datasets/horikitasaku/prott5-embedding-for-cafa5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Horikitasaku
Description
Instead of protBERT, protT5 model is used to create word embeddings

This dataset contains ProtT5 model vector embeddings for protein sequences of CAFA 5 ProtBERT :https://huggingface.co/Rostlab/prot_bert ProtT5:https://huggingface.co/Rostlab/prot_t5_xl_uniref50

According to RostLab's test, I assumed that ProtT5 would perform better than ProtBert

When the model is used for feature extraction,The two models achieve the following results, respectively:

ProtBert:

Test results :

Task/Dataset secondary structure (3-states) secondary structure (8-states) Localization Membrane
CASP12 75 63
TS115 83 72
CB513 81 66
DeepLoc 79 91

ProtT5:

Test results :

Task/Dataset secondary structure (3-states) secondary structure (8-states) Localization Membrane
CASP12 81 70
TS115 87 77
CB513 86 74
DeepLoc 81 91

Thanks For protbert-embeddings-for-cafa5 from @henriupton

notebook ```python print("Load ProtBERT Model...")

PROT BERT LOADING :

@title Import dependencies.

Load ProtT5 in half-precision (more specifically: the encoder-part of ProtT5-XL-U50)

from transformers import T5Tokenizer, T5EncoderModel import torch from Bio import SeqIO import re MAIN_DIR = "/kaggle/input/cafa-5-protein-function-prediction"

UTILITARIES

import pandas as pd import numpy as np from tqdm import tqdm import time import matplotlib.pyplot as plt import gc

WANDB FOR LIGHTNING :

import wandb import random

FILES VISUALIZATION

import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') print("Using device: {}".format(device)) transformer_link = "Rostlab/prot_t5_xl_half_uniref50-enc" print("Loading: {}".format(transformer_link)) model = T5EncoderModel.from_pretrained(transformer_link) model.full() if device=='cpu' else model.half() # only cast to full-precision if no GPU is available model = model.to(device) model = model.eval() tokenizer = T5Tokenizer.from_pretrained(transformer_link, do_lower_case=False )

class config: train_sequences_path = MAIN_DIR + "/Train/train_sequences.fasta" train_labels_path = MAIN_DIR + "/Train/train_terms.tsv" test_sequences_path = MAIN_DIR + "/Test (Targets)/testsuperset.fasta"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def get_bert_embedding( sequence : str, len_seq_limit : int ): ''' Function to collect last hidden state embedding vector from pre-trained ProtBERT Model

INPUTS: - sequence (str) : protein sequence (ex : AAABBB) from fasta file - len_seq_limit (int) : maximum sequence lenght (i.e nb of letters) for truncation OUTPUTS: - output_hidden : last hidden state embedding vector for input sequence of length 1024 ''' sequence_w_spaces = ' '.join(list(sequence)) encoded_input = tokenizer( sequence_w_spaces, truncation=True, max_length=len_seq_limit, padding='max_length', return_tensors='pt').to(config.device) output = model(**encoded_input) output_hidden = output['last_hidden_state'][:,0][0].detach().cpu().numpy() assert len(output_hidden)==1024 return output_hidden

COLLECTING FOR TRAIN SAMPLES :

print("Loading train set ProtBERT Embeddings...") fasta_train = SeqIO.parse(config.train_sequences_path, "fasta") print("Total Nb of Elements : ", len(list(fasta_train))) fasta_train = SeqIO.parse(config.train_sequences_path, "fasta") ids_list = [] embed_vects_list = [] t0 = time.time() checkpoint = 0 for item in tqdm(fasta_train, total = 142246): ids_list.append(item.id) embed_vects_list.append( get_bert_embedding(sequence = item.seq, len_seq_limit = 1200)) checkpoint+=1 if checkpoint>=100: df_res = pd.DataFrame(data={"id" : ids_list, "embed_vect" : embed_vects_list}) np.save('/kaggle/working/train_ids.npy',np.array(ids_list)) np.save('/kaggle/working/train_embeddings.npy',np.array(embed_vects_list)) checkpoint=0

np.save('/kaggle/working/train_ids.npy',np.array(ids_list)) np.save('/kaggle/working/train_embeddings.npy',np.array(embed_vects_list)) print('Total Elapsed Time:',time.time()-t0)

COLLECTING FOR TEST SAMPLES :

print("Loading test set ProtBERT Embeddings...") fasta_test = SeqIO.parse(config.test_sequences_path, "fasta") print("Total Nb of Elements : ...
h
tiny-imagenet-200
huggingface.co
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TorchUncertainty (2025). tiny-imagenet-200 [Dataset]. https://huggingface.co/datasets/torch-uncertainty/tiny-imagenet-200
Explore at:
Dataset updated
May 1, 2025
Dataset authored and provided by
TorchUncertainty
Description
Dataset Description

Tiny ImageNet is a reduced version of the original ImageNet dataset, containing 200 classes (a subset of the 1,000 ImageNet categories)

Homepage: https://www.image-net.org/

Citation

@inproceedings{deng2009imagenet, title={ImageNet: A large-scale hierarchical image database}, author={Deng, Jia and others}, booktitle={CVPR}, year={2009} }
Llama 3.1 8B Correct Labels
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
Explore at:
zip(11853454078 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Jatin Mehra_666
Description
training Code ```Python

from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

Fill missing Misconception values with 'NA'

train.Misconception = train.Misconception.fillna('NA')

Create a combined target label (Category:Misconception)

train['target'] = train.Category + ":" + train.Misconception

Encode target labels to numerical format

le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

Merge 'is_correct' flag into the main training DataFrame

train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

from datasets import Dataset

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

COLS = ['text', 'label']

Create clean DataFrame with the full training data

train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

Ensure labels are proper integers

train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

Reset index to ensure clean DataFrame structure

train_df_clean = train_df_clean.reset_index(drop=True)

Create dataset with the full training data

train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

Apply tokenization to the full dataset

train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

Add a new padding token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Resize the model's token embeddings to match the new tokenizer

model.resize_token_embeddings(len(tokenizer))

Set the pad token id in the model's config

model.config.pad_token_id = tokenizer.pad_token_id

2. Clear HF cache after loading

import os from huggingface_hub import scan_cache_dir

Then clear cache to free ~16GB

cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

--- Training Arguments ---

from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

Ensure temp directories exist

os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

--- Training Arguments (FIXED) ---

training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
)

--- Custom Metric Computation (MAP@3) ---

def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

# Get top 3 predicted class indi...

Task/Dataset	secondary structure (3-states)	secondary structure (8-states)	Localization	Membrane
CASP12	75	63
TS115	83	72
CB513	81	66
DeepLoc			79	91

Task/Dataset	secondary structure (3-states)	secondary structure (8-states)	Localization	Membrane
CASP12	81	70
TS115	87	77
CB513	86	74
DeepLoc			81	91

nomic-embed-text-v1.5 model

kaggle.com

zip

Updated Mar 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

averagemn (2024). nomic-embed-text-v1.5 model [Dataset]. https://www.kaggle.com/datasets/donkeys/nomic-embed-text-v1-5-model/code

Explore at:

zip(2239413492 bytes)Available download formats

Dataset updated

Mar 14, 2024

Authors

averagemn

Description

8k context length embedding model from https://huggingface.co/nomic-ai/nomic-embed-text-v1.5. For offline use in Kaggle notebooks.

Useful for trialing embeddings based search over larger texts. Check my notebook for example: https://www.kaggle.com/code/donkeys/q-a-q-from-summaries-a-from-writeups

See the model page for more details: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

Some example code (check my notebook above and model page for more details):

from sentence_transformers import util
from sentence_transformers import SentenceTransformer
import torch

embed_model = SentenceTransformer("/kaggle/input/nomic-embed-text-v1-5-model/nomic-embed-text-v1.5", trust_remote_code=True)
search_q = f"search_query: {search_q_desc}"

def encode_in_batches(my_model, documents, batch_size=16):
  embeddings = []
  for i in tqdm(range(0, len(documents), batch_size)):
    batch = list(documents[i:i+batch_size].values)
    batch_embeddings = my_model.encode(batch, convert_to_tensor=True)
    embeddings.extend(batch_embeddings) 
    
    torch.cuda.empty_cache()

  return embeddings

df["search_doc"] = df["desc"].apply(lambda x: "search_document: "+x)
doc_embeddings = encode_in_batches(embed_model, df["search_doc"])

embeddings = embed_model.encode(sentences, convert_to_tensor=True)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
similarities = util.cos_sim(q_embeddings[0], doc_embeddings)
similarities = similarities[0]

top_n = 50
top_k_values, top_k_indices = torch.topk(similarities, top_n)

top_k_rows = []
top_k_indices = top_k_indices[0]
for idx in top_k_indices:
  top_row = df.iloc[int(idx)]
  top_k_rows.append(top_row)
...

deepseek-math-7b-instruct

kaggle.com

zip

Updated Jun 16, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

ak1bot (2024). deepseek-math-7b-instruct [Dataset]. https://www.kaggle.com/datasets/ak1bot/deepseek-math-7b-instruct

Explore at:

zip(10946645007 bytes)Available download formats

Dataset updated

Jun 16, 2024

Authors

ak1bot

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Source

https://github.com/deepseek-ai/DeepSeek-Math https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct

How to use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "/kaggle/input/deepseek-math-7b-instruct/deepseek-math-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
messages = [
  {"role": "user", "content": "what is the integral of x^2 from 0 to 2?
Please reason step by step, and put your final answer within \\boxed{}."}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

h
subtitles
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dharmini, subtitles [Dataset]. https://huggingface.co/datasets/mini1234/subtitles
Explore at:
Authors
dharmini
Description
!pip install datasets transformers torch pandas import datasets import transformers import torch import pandas as pd pip show datasets pip show transformers pip show torch pip show pandas

Install required dependencies Run this in your terminal or in a separate cell: !pip install datasets transformers torch pandas

import pandas as pd from datasets import Dataset from transformers import MarianMTModel, MarianTokenizer, TrainingArguments, Trainer import torch… See the full description on the dataset page: https://huggingface.co/datasets/mini1234/subtitles.
h
torch-forum
huggingface.co
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Egor Konovalov (2023). torch-forum [Dataset]. https://huggingface.co/datasets/foldl/torch-forum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2023
Authors
Egor Konovalov
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for "torch-forum"

Dataset structure { title:str category:str, posts:List[{ poster:str, contents:str, likes:int, isAccepted:bool }] }
h
Flowers102-C
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TorchUncertainty, Flowers102-C [Dataset]. https://huggingface.co/datasets/torch-uncertainty/Flowers102-C
Explore at:
Dataset authored and provided by
TorchUncertainty
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The corrupted version of the Flowers102 fined-grained classification dataset.

How to use this dataset

For all the corruptions, extract the tar.gz files with the following command: for f in *.tar.gz; do tar -xzf "$f" && rm "$f"; done

License

The license of the original dataset is unclear.

Citation

If you use this dataset please consider citing: The authors of the original dataset, @inproceedings{nilsback2008automated, title={Automated flower… See the full description on the dataset page: https://huggingface.co/datasets/torch-uncertainty/Flowers102-C.
h
API-Bench-TorchHub
huggingface.co
Updated Aug 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eitan Turok (2024). API-Bench-TorchHub [Dataset]. https://huggingface.co/datasets/eitanturok/API-Bench-TorchHub
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2024
Authors
Eitan Turok
Description
eitanturok/API-Bench-TorchHub dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Texture
huggingface.co
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TorchUncertainty (2025). Texture [Dataset]. https://huggingface.co/datasets/torch-uncertainty/Texture
Explore at:
Dataset updated
May 1, 2025
Dataset authored and provided by
TorchUncertainty
Description
Dataset Description

The Describable Textures Dataset (DTD) includes 5,640 images of textures annotated with human-centric attributes. This split is derived from the OpenOOD benchmark OOD evaluation splits.

Homepage: https://www.robots.ox.ac.uk/~vgg/data/dtd/ OpenOOD Benchmark: https://github.com/Jingkang50/OpenOOD/

Citation

@inproceedings{cimpoi2014describing, title={Describing textures in the wild}, author={Cimpoi, Mircea and others}, booktitle={CVPR}… See the full description on the dataset page: https://huggingface.co/datasets/torch-uncertainty/Texture.
h
example-dataset-H100-Qwen3-Coder-30B-A3B-Instruct-torchbench
huggingface.co
Updated Sep 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matej Sirovatka (2025). example-dataset-H100-Qwen3-Coder-30B-A3B-Instruct-torchbench [Dataset]. https://huggingface.co/datasets/siro1/example-dataset-H100-Qwen3-Coder-30B-A3B-Instruct-torchbench
Explore at:
Dataset updated
Sep 26, 2025
Authors
Matej Sirovatka
Description
siro1/example-dataset-H100-Qwen3-Coder-30B-A3B-Instruct-torchbench dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SSB_hard
huggingface.co
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TorchUncertainty (2025). SSB_hard [Dataset]. https://huggingface.co/datasets/torch-uncertainty/SSB_hard
Explore at:
Dataset updated
May 1, 2025
Dataset authored and provided by
TorchUncertainty
Description
Dataset Description

SSB-Hard is an OOD dataset focusing on semantically similar but distinct categories to ImageNet. This split is derived from the OpenOOD benchmark OOD evaluation splits.

OpenOOD Benchmark: https://github.com/Jingkang50/OpenOOD/

Citation

@inproceedings{vaze2022openset, title={Open-set Recognition: A Good Closed-set Classifier is All You Need?}, author={Vaze, Siddharth and others}, booktitle={ICLR}, year={2022} } @inproceedings{yang2022openood… See the full description on the dataset page: https://huggingface.co/datasets/torch-uncertainty/SSB_hard.

Facebook

Twitter

Click to copy link

Link copied

Cite

A2 Labs, Inc., lenny-functional-torch [Dataset]. https://huggingface.co/datasets/a2labs-ai/lenny-functional-torch

lenny-functional-torch

a2labs-ai/lenny-functional-torch

Explore at:

Dataset provided by

MAKO

Authors

A2 Labs, Inc.

Description

a2labs-ai/lenny-functional-torch dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

lenny-functional-torch

Places365-C

torch-to-manim

MeDAL Dataset

Downloading the data

Loading FastText Embeddings

Model Quickstart

Using Torch Hub

Using Huggingface transformers

Citation

License, Terms and Conditions

sbert models

Code example

ood-datasets-splits

backendbench_tests

FFHQ_1024_DC-AE_f128

protT5-embedding-for-cafa5

Instead of protBERT, protT5 model is used to create word embeddings

According to RostLab's test, I assumed that ProtT5 would perform better than ProtBert

ProtBert:

ProtT5:

PROT BERT LOADING :

@title Import dependencies.

Load ProtT5 in half-precision (more specifically: the encoder-part of ProtT5-XL-U50)

UTILITARIES

WANDB FOR LIGHTNING :

FILES VISUALIZATION

COLLECTING FOR TRAIN SAMPLES :

COLLECTING FOR TEST SAMPLES :

tiny-imagenet-200

Llama 3.1 8B Correct Labels

Fill missing Misconception values with 'NA'

Create a combined target label (Category:Misconception)

Encode target labels to numerical format

Merge 'is_correct' flag into the main training DataFrame

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

Create clean DataFrame with the full training data

Ensure labels are proper integers

Reset index to ensure clean DataFrame structure

Create dataset with the full training data

Apply tokenization to the full dataset

Add a new padding token

Resize the model's token embeddings to match the new tokenizer

Set the pad token id in the model's config

2. Clear HF cache after loading

Then clear cache to free ~16GB

--- Training Arguments ---

Ensure temp directories exist

--- Training Arguments (FIXED) ---

--- Custom Metric Computation (MAP@3) ---

nomic-embed-text-v1.5 model

deepseek-math-7b-instruct

Source

How to use

subtitles

torch-forum

Flowers102-C

API-Bench-TorchHub

Texture

example-dataset-H100-Qwen3-Coder-30B-A3B-Instruct-torchbench

SSB_hard

lenny-functional-torch

a2labs-ai/lenny-functional-torch

Using Huggingface `transformers`