Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.
| Variable | Description |
|---|---|
| model_id | |
| pipeline | There are total 40 pipelines. To learn more read: Hugging Face Pipeline |
| downloads | |
| likes | |
| author_id | |
| author_name | |
| author_type | user or organization |
| author_isPro | Paid user or organization |
| lastModified | from 2014-08-10 to 2023-11-27 |
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
KTDA-Datasets
This dataset card aims to describe the datasets used in the KTDA.
Install
pip install huggingface-hub
Usage
huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include grass.zip huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include cloud.zip
unzip grass.zip -d grass unzip cloud.zip -d l8_biome… See the full description on the dataset page: https://huggingface.co/datasets/XavierJiezou/ktda-datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset tracks repository name changes over time. Each row represents a unique combination of repository ID and name, with the timestamp of when that name was first observed. Since GitHub allows repository renaming while preserving the internal repository ID, this dataset enables tracking the full naming history of any repository.
Schema
Column Type Description
repo_id int64 GitHub's internal repository identifier
repo_name string Repository… See the full description on the dataset page: https://huggingface.co/datasets/git2vec/repo_names.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Source repo is google/flan-t5-large.
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('/kaggle/input/googleflan-t5-large/flan-t5-large')
tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/googleflan-t5-large/flan-t5-large')
Facebook
TwitterThis dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository, and hosted on Kaggle. It will be automatically updated every month to ensure that the latest version is available to the user. By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook.
The banner was adapted from figures by Jimmy Lin (tweet; slide) released under CC BY 4.0. BERT has an Apache 2.0 license according to the model repository.
To use this dataset, simply attach it the your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForMaskedLM
MODEL_DIR = "/kaggle/input/huggingface-bert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased")
model = AutoModelForMaskedLM.from_pretrained(MODEL_DIR + "bert-large-uncased")
All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. The banner image was created thanks to Jimmy Lin so any modification of this figure should mention the original author and respect the conditions of the license; all copyrights related to the images belong to him.
Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
GitHub R repositories dataset
R source files from GitHub.
This dataset has been created using the public GitHub datasets from Google BigQuery.
This is the actual query that has been used to export the data:
EXPORT DATA
OPTIONS (
uri = 'gs://your-bucket/gh-r/*.parquet',
format = 'PARQUET') as
(
select
f.id, f.repo_name, f.path,
c.content, c.size
from (
SELECT distinct
id, repo_name, path
FROM bigquery-public-data.github_repos.files
where ends_with(path… See the full description on the dataset page: https://huggingface.co/datasets/dfalbel/github-r-repos.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FineWeb 2 is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. For the actual data, please see the HuggingFace repository.
This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.
The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.
In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
The dataset is also listed on HF, here is official HF Page.
"My focus is on sharing this valuable open-source dataset to help AI and ML practitioners easily find resources on Kaggle."
The detailed information about FW2 is listed in README.md file below ↓
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
TF-ID arXiv papers dataset
This is the dataset for finetuning TF-ID models. It contains about 4,600 images (academic paper pages) with bounding boxes of tables and figures in coco format. The papers are selected from Hugging Face Daily Papers, covering mostly AI/ML/DL related topics. You can use this dataset to reproduce all TF-ID models. All bounding boxes were annotated manually by Yifei Hu
Project Repo
github.com/ai8hyf/TF-ID
Variants
Unzip the… See the full description on the dataset page: https://huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers.
Facebook
TwitterXAMI: XMM-Newton optical Artefact Mapping for astronomical Instance segmentation The Dataset
Check the XAMI model and the XAMI dataset on Github.
Downloading the dataset
using a python script
from huggingface_hub import hf_hub_download
dataset_name = 'xami_dataset' # the dataset name of Huggingface images_dir = '.' # the output directory of the dataset images
hf_hub_download( repo_id="iulia-elisa/XAMI-dataset", # the Huggingface repo ID repo_type='dataset'… See the full description on the dataset page: https://huggingface.co/datasets/iulia-elisa/XAMI-dataset.
Facebook
Twitterfrom huggingface_hub import HfApi api = HfApi() api.upload_file( path_or_fileobj="C:\Users\tusha\Desktop\New folder", # Replace with your file path path_in_repo="data.csv", repo_id="Tusharbansod108/disease_data", # Replace with your repo ID repo_type="dataset" )
Facebook
TwitterThis is a pretrained transformer that is available in transformers module, from huggingface here:
https://huggingface.co/cross-encoder/nli-distilroberta-base
The files in this repository are uploaded from the source from the developers' website:
read the readme.md file in huggingface repo for more info: https://huggingface.co/cross-encoder/nli-distilroberta-base/blob/main/README.md
also, take a look into sentence-transformers documentation for more models and usage:
https://www.sbert.net/docs/pretrained_models.html
The model files are relocated in the 0_Transformer folder.
example:
classifier = pipeline("zero-shot-classification",
model="../input/crossencodernlidistilrobertabasev2/0_Transformer",
)
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Asset Download
The assets need to be placed into RLinf's ManiSkill environment folder with the name assets.
cd
You can also use git to clone the repository: cd
License
Our assets are attributed to… See the full description on the dataset page: https://huggingface.co/datasets/RLinf/maniskill_assets.
Facebook
TwitterGLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.
data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than othersmodel.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.To use this model, you must install the GLiNER Python library:
!pip install gliner
Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_base")
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
Facebook
TwitterDownload Dataset
from datasets import load_dataset
repo_id = "sxj1024/cumcm_test"
dataset = load_dataset(repo_id)
print(dataset)
Facebook
TwitterUse this dataset and run !pip install -U --no-build-isolation --no-deps ../input/transformers-master/ -qq or do !pip install -U transformers
Pretrained RemBERT model on 110 languages using a masked language modeling (MLM) objective. It was introduced in the paper Rethinking embedding coupling in pre-trained language models. A direct export of the model checkpoint was first made available in this repository. This version of the checkpoint is lightweight since it is meant to be finetuned for classification and excludes the output embedding weights.
RemBERT's main difference with mBERT is that the input and output embeddings are not tied. Instead, RemBERT uses small input embeddings and larger output embeddings. This makes the model more efficient since the output embeddings are discarded during fine-tuning. It is also more accurate, especially when reinvesting the input embeddings' parameters into the core model, as is done on RemBERT.
You should fine-tune this model for your downstream task. It is meant to be a general-purpose model, similar to mBERT. In our paper, we have successfully applied this model to tasks such as classification, question answering, NER, POS-tagging. For tasks such as text generation you should look at models like GPT2.
The RemBERT model was pretrained on multilingual Wikipedia data over 110 languages. The full language list is on this repository
Facebook
Twitterhttps://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
CHARM
📃 Paper • 💻 [Github Repo] • 🌐 Project Page
This repository contains the test dataset presented in the paper CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling. CHARM is a novel parametric representation and generative framework for anime hairstyle modeling.
Usage
You can download the files directly from this repository or use the Hugging Face datasets library: from huggingface_hub import hf_hub_download, list_repo_files
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023.
We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization.
The Arxiver dataset is created using a neural OCR - Nougat. After OCR processing, we apply custom text processing steps to refine the data. This includes extracting author information, removing reference sections, and performing additional cleaning and formatting. Please refer to our GitHub repo for details.
The original articles are maintained by arXiv and copyrighted to the original authors, please refer to the arXiv license information page for details. We release our dataset with a Creative Commons Attribution-Noncommercial-ShareAlike (CC BY-NC-SA 4.0) license, if you use this dataset in your research or project, please cite it as follows:
@misc{acar_arxiver2024,
author = {Alican Acar, Alara Dirik, Muhammet Hatipoglu},
title = {ArXiver},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/neuralwork/arxiver}}
}
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset captures branch creation history across all GitHub repositories. Each row represents a unique combination of repository ID and branch name, with the timestamp of the first observed push to that branch. This enables analysis of branching strategies, feature branch lifecycles, and development workflow patterns.
Schema
Column Type Description
repo_id int64 GitHub's internal repository identifier
branch_name string Name of the branch… See the full description on the dataset page: https://huggingface.co/datasets/git2vec/repo_branches.
Facebook
Twitterhttps://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/
This repo contains BSD100, Set5 and Set14 for super resolution evaluation study. To access the zipped file: from huggingface_hub import hf_hub_download
repo_id = "keanteng/bsd100-set5-set14" filename = "BSD100.zip" # or Set5.zip and Set14.zip
local_filepath = hf_hub_download(repo_id=repo_id, filename=filename) print(f"File downloaded to: {local_filepath}")
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.
| Variable | Description |
|---|---|
| model_id | |
| pipeline | There are total 40 pipelines. To learn more read: Hugging Face Pipeline |
| downloads | |
| likes | |
| author_id | |
| author_name | |
| author_type | user or organization |
| author_isPro | Paid user or organization |
| lastModified | from 2014-08-10 to 2023-11-27 |