Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This dataset contains the 20 trending repositories of each type: models, datasets, and space, on Hugging Face, every day. Each type can be loaded from its own dataset config.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
Not relevant.
Dataset Structure
Data Instances
The dataset contains three configurations: models: the history of trending models on Hugging… See the full description on the dataset page: https://huggingface.co/datasets/severo/trending-repos.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.
| Variable | Description |
|---|---|
| model_id | |
| pipeline | There are total 40 pipelines. To learn more read: Hugging Face Pipeline |
| downloads | |
| likes | |
| author_id | |
| author_name | |
| author_type | user or organization |
| author_isPro | Paid user or organization |
| lastModified | from 2014-08-10 to 2023-11-27 |
Facebook
TwitterThis dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-roberta/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
Twitterjsulz/hub-repo-stats dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.
This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.
Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
Facebook
TwitterThis repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:
ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...
You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)
Facebook
TwitterRTL-Repo Benchmark
This repository contains the data for the RTL-Repo benchmark introduced in the paper RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects.
👋 Overview
RTL-Repo is a benchmark for evaluating LLMs' effectiveness in generating Verilog code autocompletions within large, complex codebases. It assesses the model's ability to understand and remember the entire Verilog repository context and generate new code that is correct, relevant… See the full description on the dataset page: https://huggingface.co/datasets/ahmedallam/RTL-Repo.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the data used in the paper title "On the Suitability of Hugging Face Hub for Empirical Studies". For RQ1 we share the survey responses and the interview transcription, while for RQ2 we share the link to the repository where the data is hosted.
Facebook
TwitterThis dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the mobilebert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FineWeb 2 is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. For the actual data, please see the HuggingFace repository.
This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.
The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.
In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
The dataset is also listed on HF, here is official HF Page.
"My focus is on sharing this valuable open-source dataset to help AI and ML practitioners easily find resources on Kaggle."
The detailed information about FW2 is listed in README.md file below ↓
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
KTDA-Datasets
This dataset card aims to describe the datasets used in the KTDA.
Install
pip install huggingface-hub
Usage
huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include grass.zip huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include cloud.zip
unzip grass.zip -d grass unzip cloud.zip -d l8_biome… See the full description on the dataset page: https://huggingface.co/datasets/XavierJiezou/ktda-datasets.
Facebook
Twittersonyashijin/rtl-repo-curated dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittertdross/test-repo dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersamfred2/my-target-repo dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Guide: How to share your data on the BoAmps repository
This guide explains step by step how to share BoAmps format reports on this public Hugging Face repository.
Prerequisites
Before starting, make sure you have:
A Hugging Face account The files you want to upload
Method 1: Hugging Face Web Interface
Log in to Hugging Face
Go to the boamps dataset
Navigate to the files: Click on "Files and versions" then on the "data" folder
Click on "Contribute" then… See the full description on the dataset page: https://huggingface.co/datasets/boavizta/open_data_boamps.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
tstone87/repo dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dear researchers and engineers, you're accessing a dataset that would cost millions of dollars to build and took millions of nerves to negotiate favorable terms for its use. Your support, by liking the repositories and upvoting the collection, costs nothing but gives us valuable motivation to continue our contributions to the community. We reserve the right not to approve the request if you don't support our efforts. Thank you very much for collaboration!
HISTAI Dataset
HISTAI is a… See the full description on the dataset page: https://huggingface.co/datasets/histai/HISTAI-metadata.
Facebook
TwittercityTS/gxucpc-repo-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This dataset contains the 20 trending repositories of each type: models, datasets, and space, on Hugging Face, every day. Each type can be loaded from its own dataset config.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
Not relevant.
Dataset Structure
Data Instances
The dataset contains three configurations: models: the history of trending models on Hugging… See the full description on the dataset page: https://huggingface.co/datasets/severo/trending-repos.