Dataset Card for Hugging Face Hub Model Cards
This datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/chen1914/card.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
statistics.r
: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreementsmodelsInfo.zip
: zip file containing all the downloaded model cards (in JSON format)script
: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.Dataset/Dataset_HF-models-list.csv
: list of HF models analyzedDataset/Dataset_github-prj-list.txt
: list of GitHub projects using the transformers libraryDataset/Dataset_github-Prj_model-Used.csv
: contains usage pairs: project, modelDataset/Dataset_prj-num-models-reused.csv
: number of models used by each GitHub projectDataset/Dataset_model-download_num-prj_correlation.csv
contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloadsRQ1/RQ1_dataset-list.txt
: list of HF datasetsRQ1/RQ1_datasetSample.csv
: sample set of models used for the manual analysis of datasetsRQ1/RQ1_analyzeDatasetTags.py
: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip
in a directory with the same name (modelsInfo
) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py
scriptRQ1/RQ1_countDataset.py
: given the output of RQ2/analyzeDatasetTags.py
(passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysisRQ1/RQ1_datasetTags.csv
: output of RQ2/analyzeDatasetTags.py
RQ1/RQ1_dataset_usage_count.csv
: output of RQ2/countDataset.py
RQ2/tableBias.pdf
: table detailing the number of occurrences of different types of bias by model TaskRQ2/RQ2_bias_classification_sheet.csv
: results of the manual labelingRQ2/RQ2_isBiased.csv
: file to compute the inter-rater agreement of whether or not a model documents BiasRQ2/RQ2_biasAgrLabels.csv
: file to compute the inter-rater agreement related to bias categoriesRQ2/RQ2_final_bias_categories_with_levels.csv
: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate categoryRQ3/RQ3_LicenseValidation.csv
: manual validation of a sample of licensesRQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt
: lists of licenses with different permissivenessRQ3/RQ3_prjs_license.csv
: for each project linked to models, among other fields it indicates the license tag and nameRQ3/RQ3_models_license.csv
: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of licenseRQ3/RQ3_model-prj-license_contingency_table.csv
: usage contingency table between projects' licenses (columns) and models' licenses (rows)RQ3/RQ3_models_prjs_licenses_with_type.csv
: pairs project-model, with their respective licenses and permissiveness levelContains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question. In particular, many of them are used in toy projects or simply as a mirror for the HF repository. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects.
This artifact contains CodeXHug, a curated dataset of HuggingFace PTMs exploited in the GitHub ecosystem. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the GitHub platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 372,063 Python files. We also present a statistical analysis of the dataset, highlighting the most popular PTMs and the most common tasks for which they are used. Finally, we discuss the research opportunities enabled by CodeXHug and the implications of our findings for the software engineering community.
librarian-bots/model-card-sentences dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
language: en
Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.
Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='gpt2-large')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
[{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
{'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
{'generated_text': "Hello, I'm a language model, why does this matter for you?
When I hear new languages, I tend to start thinking in terms"},
{'generated_text': "Hello, I'm a language model, a functional language...
I don't need to know anything else. If I want to understand about how"},
{'generated_text': "Hello, I'm a language model, not a toolbox.
In a nutshell, a language model is a set of attributes that define how"}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = TFGPT2Model.from_pretrained('gpt2-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
In their model card about GPT-2, OpenAI wrote:
The primary intended users of these models are AI researchers and practitioners.
We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.
In their model card about GPT-2, OpenAI wrote:
Here are some secondary use cases we believe are likely:
- Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
- Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
- Entertainment: Creation of games, chat bots, and amusing generations.
In their model card about GPT-2, OpenAI wrote:
Because large-scale language models like GPT-2 ...
Dataset Card for HuggingFaceH4/rs_test
SFT model: HuggingFaceH4/falcon-40b-ift-v3.1 Reward model: HuggingFaceH4/pythia-70m-rm-v0.0 Temperature: 0.7
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for H4 Stack Exchange Preferences Dataset
Dataset Summary
This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This work shares a dataset that contains Spanish (SPA) to Mexican Sign Language (MSL) glosses -transcripted MSL- pairs of sentences for a downstream task. The methodology used to prepare the shared dataset considered the construction of SPA-to-MSL corpus with a specific representation of the Spanish language for MSL interpretation. The proposed corpus is a referencedataset for evaluating diverse neural machine translation (NMT) system variants. With the support of grammatical MSL books and advice from MSL interpreters, this study developed a 3000 sentence pairs SPA-to-MSL dataset. The distribution of 3000 sentences in the corpus follows the linguistic composition of the Spanish language. With the aim of testing the functionality of the corpus as a data source for NMT, two neural transformers models for Spanish paraphrasis were used to test the usability of the proposed dataset. The first NMT model uses a Helsinki-NLP SPA-SPA transformer developed by the Language Technologies Research Group at the University of Helsinki. The second NMT model considers a Spa-to-Spa pre-trained neural transformer presented as a BARTOapproach. Both evaluations considered a transfer learning strategy, which has been demonstrated to be effective for modeling low-resource languages achieving state of art results in translation quality.Spanish-MSL glosses dataset -IT is a .xlsx format file that contains 3000 Spanish-MSL glosses pairs. To use dataset it needs to be converted to .csv formatModel M1- It is a Colab file that contains the programming methodology for finetunning Helsinki-NLP/opus-mt-es-es available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured. The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M2- It is a Colab file that contains the programming methodology for finetunning vgaraujov/bart-base-spanish available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured.The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-barto-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M1-split-version and Model M2-split-version is the dataset splitted in 80% training, 10% validation and 10% testing. Model cards are avilable at: https://huggingface.co/vania2911/esp-to-lsm-barto-model and https://huggingface.co/vania2911/esp-to-lsm-model-split.Translations M1 and M2 contain the reference and predicted translations for each model.
Dataset Card for "input-dataset"
More Information needed
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for OpenAI HumanEval
Dataset Summary
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
Supported Tasks and Leaderboards
Languages
The programming problems are written in Python and contain English natural text in comments and… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
Dataset Card for "instruction-pilot-outputs-greedy"
This dataset contains model outputs generated from the human demonstrations provided in HuggingFaceH4/instruction-pilot-prompts.
To convert each language model into a dialogue agent, we prepended the following LangChain prompt to each input:
The following is a friendly conversation between a human and an AI.
The AI is talkative and provides lots of specific details from its context.
If the AI does not know the answer to a… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/instruction-pilot-outputs-greedy.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We present XLSum, a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
Dataset Card for "pythia-70m-rs"
More Information needed
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Model Card: Document Visual Retrieval Test (internal)
Dataset Overview
This dataset is designed to evaluate the performance of visual retrievers by testing their ability to match a query to a relevant image. Each of the three examples in this dataset contains a text query and an associated image, which is a scanned page from the foundational "Attention is All You Need" paper. The purpose of this dataset is to facilitate the evaluation of visual retrievers, where… See the full description on the dataset page: https://huggingface.co/datasets/hf-internal-testing/document-visual-retrieval-test.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "helpful-self-instruct-raw"
This dataset is derived from the finetuning subset of Self-Instruct, with some light formatting to remove trailing spaces and <|endoftext|> tokens.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Dataset Name
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
XGLUE is a new benchmark dataset to evaluate the performance of cross-lingual pre-trained models with respect to cross-lingual natural language understanding and generation. The benchmark is composed of the following 11 tasks: - NER - POS Tagging (POS) - News Classification (NC) - MLQA - XNLI - PAWS-X - Query-Ad Matching (QADSM) - Web Page Ranking (WPR) - QA Matching (QAM) - Question Generation (QG) - News Title Generation (NTG)
For more information, please take a look at https://microsoft.github.io/XGLUE/.
Dataset Card for Hugging Face Hub Model Cards
This datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.