Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
FineVideo
FineVideo Description Dataset Explorer Revisions Dataset Distribution
How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset
Dataset StructureData Instances Data Fields
Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases
Additional Information Credits Future Work Opting out of FineVideo Citation Information
Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by amulil
Released under GPL 2
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 FineWeb-Edu
1.3 trillion tokens of the finest educational data the 🌐 web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
Facebook
TwitterThis dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-roberta/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
D4RL Dataset on HuggingFace
This repository hosts the pre-downloaded D4RL dataset on HuggingFace. It is designed to provide accelerated data downloading for users, eliminating the need to download the dataset from scratch.
Installation
To use this dataset, you need to clone it into your local .d4rl directory. Here are the steps to do so:
Navigate to your .d4rl directory:
cd ~/.d4rl
Clone the dataset repository from HuggingFace:
git clone… See the full description on the dataset page: https://huggingface.co/datasets/imone/D4RL.
Facebook
TwitterThis dataset contains different variants of the ALBERTv2 model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the albert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-albert-v2/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "albert-base-v2")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "albert-base-v2")
Acknowledgements All the copyrights and IP relating to ALBERT belong to the original authors (Lan et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
KTDA-Datasets
This dataset card aims to describe the datasets used in the KTDA.
Install
pip install huggingface-hub
Usage
huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include grass.zip huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include cloud.zip
unzip grass.zip -d grass unzip cloud.zip -d l8_biome… See the full description on the dataset page: https://huggingface.co/datasets/XavierJiezou/ktda-datasets.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
This dataset is the Version 2.0 of microsoft/FStarDataSet.
Primary-Objective
This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).
Data Format
Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.
Facebook
TwitterThe dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub
Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-all", repo_type="dataset")
IMPORTANT
Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-all.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Downloading the Options IV SP500 Dataset
This document will guide you through the steps to download the Options IV SP500 dataset from Hugging Face Datasets. This dataset includes data on the options of the S&P 500, including implied volatility. To start, you'll need to install Hugging Face's datasets library if you haven't done so already. You can do this using the following pip command: !pip install datasets
Here's the Python code to load the Options IV SP500 dataset from Hugging… See the full description on the dataset page: https://huggingface.co/datasets/gauss314/options-IV-SP500.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for GSM8K
Dataset Summary
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenAssistant Conversations Dataset (OASST1)
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for VLM4Bio
Instructions for downloading the dataset
Install Git LFS Git clone the VLM4Bio repository to download all metadata and associated files Run the following commands in a terminal:
git clone https://huggingface.co/datasets/imageomics/VLM4Bio cd VLM4Bio
Downloading and processing bird images
To download the bird images, run the following command:
bash download_bird_images.sh
This should download the bird images inside datasets/Bird/images… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/VLM4Bio.
Facebook
TwitterDataset Card for The Cauldron
Dataset description
The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for GLUE
Dataset Summary
GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
Supported Tasks and Leaderboards
The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:
ax
A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MInDS-14
MINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.
Example
MInDS-14 can be downloaded and used as follows: from datasets import load_dataset
minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
FineVideo
FineVideo Description Dataset Explorer Revisions Dataset Distribution
How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset
Dataset StructureData Instances Data Fields
Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases
Additional Information Credits Future Work Opting out of FineVideo Citation Information
Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.