Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.
This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for H4 Stack Exchange Preferences Dataset
Dataset Summary
This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped withโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
Facebook
TwitterThis dataset was created by Xen Xiou
Facebook
TwitterThis is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.
!pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q
Facebook
TwitterDataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [Moreโฆ See the full description on the dataset page: https://huggingface.co/datasets/templates/dataset-card-example.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Mohannad Ayman Salah
Released under MIT
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.
This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.
Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by amulil
Released under GPL 2
Facebook
TwitterThis is a labeled corpus dataset of article text with corresponding political bias obtained from Huggingface. It contains 17,362 articles labeled left, right, or center by the editors of allsides.com. Articles were manually annotated by news editors who were attempting to select representative articles from the left, right and center of each article topic.
Facebook
Twitterhuggingface-projects/drlc-leaderboard-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhuggingface/paper-central-data-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated byโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
huggingface-projects/contribute-a-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterHabibAhmed/Data-Science-Instruct-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ FineWeb-Edu
1.3 trillion tokens of the finest educational data the ๐ web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
๐ FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from ๐ท FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Drive Stats
Drive Stats is a public data set of daily metrics on the hard drives in Backblazeโs cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating aโฆ See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.
Facebook
Twittersnoop2head/enron_aeslc_emails dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterNicolybgs/healthcare_data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.
This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated