100+ datasets found

e
model_cards_with_metadata
hf-proxy-cf.effarig.site
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Librarian Bots, model_cards_with_metadata [Dataset]. https://hf-proxy-cf.effarig.site/datasets/librarian-bots/model_cards_with_metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Librarian Bots
Description
Dataset Card for Hugging Face Hub Model Cards

This datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.
h
card
huggingface.co
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
card [Dataset]. https://huggingface.co/datasets/chen1914/card
Explore at:
Dataset updated
Jan 20, 2025
Authors
fang chen
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

Dataset Sources [optional]

Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/chen1914/card.
Z
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
data.niaid.nih.gov
zenodo.org
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mastropaolo, Antonio (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200098
Explore at:
Dataset updated
Jan 16, 2024
Dataset provided by
Mastropaolo, Antonio
Pepe, Federica
Nardone, Vittoria
Di Penta, Massimiliano
Canfora, Gerardo
BAVOTA, Gabriele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

Root directory

statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)

script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

Dataset

Dataset/Dataset_HF-models-list.csv: list of HF models analyzed

Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library

Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model

Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project

Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

RQ1

RQ1/RQ1_dataset-list.txt: list of HF datasets

RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets

RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script

RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py

RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

RQ2

RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task

RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling

RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias

RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories

RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

RQ3

RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses

RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness

RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name

RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)

RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

scripts

Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in...
zenodo.org
bin, zip
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudio Di Sipio; Claudio Di Sipio; Juri Di Rocco; Juri Di Rocco; Davide Di Ruscio; Davide Di Ruscio; Stefano Palombo; Stefano Palombo (2024). CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem [Dataset]. http://doi.org/10.5281/zenodo.14267550
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14267550
Dataset updated
Dec 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Claudio Di Sipio; Claudio Di Sipio; Juri Di Rocco; Juri Di Rocco; Davide Di Ruscio; Davide Di Ruscio; Stefano Palombo; Stefano Palombo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question. In particular, many of them are used in toy projects or simply as a mirror for the HF repository. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects.
This artifact contains CodeXHug, a curated dataset of HuggingFace PTMs exploited in the GitHub ecosystem. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the GitHub platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 372,063 Python files. We also present a statistical analysis of the dataset, highlighting the most popular PTMs and the most common tasks for which they are used. Finally, we discuss the research opportunities enabled by CodeXHug and the implications of our findings for the software engineering community.
h
model-card-sentences
huggingface.co
Updated Nov 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
model-card-sentences [Dataset]. https://huggingface.co/datasets/librarian-bots/model-card-sentences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset authored and provided by
Librarian Bots
Description
librarian-bots/model-card-sentences dataset hosted on Hugging Face and contributed by the HF Datasets community
AIMO-24: Model (openai-community/gpt2-large)
kaggle.com
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

How To Get Started With the Model

Uses

Risks, Limitations and Biases

Training

Evaluation

Environmental Impact

Technical Specifications

Citation Information

Model Card Authors

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.

Model Type: Transformer-based language model

Language(s): English

License: Modified MIT License

Related Models: GPT-2, GPT-Medium and GPT-XL

Resources for more information:

Research Paper

OpenAI Blog Post

GitHub Repo

OpenAI Model Card for GPT-2

Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='gpt2-large') >>> set_seed(42) >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5) [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"}, {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"}, {'generated_text': "Hello, I'm a language model, why does this matter for you? When I hear new languages, I tend to start thinking in terms"}, {'generated_text': "Hello, I'm a language model, a functional language... I don't need to know anything else. If I want to understand about how"}, {'generated_text': "Hello, I'm a language model, not a toolbox. In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = TFGPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...
rs_test
huggingface.co
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). rs_test [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/rs_test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
Dataset Card for HuggingFaceH4/rs_test

SFT model: HuggingFaceH4/falcon-40b-ift-v3.1 Reward model: HuggingFaceH4/pythia-70m-rm-v0.0 Temperature: 0.7
stack-exchange-preferences
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for H4 Stack Exchange Preferences Dataset

Dataset Summary

This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
S
Spanish to Mexican Sign Language (MSL) glosses corpus for NLP tasks
scidb.cn
figshare.com
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diana Vania Lara Ortiz; Jorge Isaac Chairez Oria; Rita Fuentes Quetziquel Aguilar (2025). Spanish to Mexican Sign Language (MSL) glosses corpus for NLP tasks [Dataset]. http://doi.org/10.57760/sciencedb.21522
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.21522
Dataset updated
Mar 3, 2025
Dataset provided by
Science Data Bank
Authors
Diana Vania Lara Ortiz; Jorge Isaac Chairez Oria; Rita Fuentes Quetziquel Aguilar
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Mexico
Description
This work shares a dataset that contains Spanish (SPA) to Mexican Sign Language (MSL) glosses -transcripted MSL- pairs of sentences for a downstream task. The methodology used to prepare the shared dataset considered the construction of SPA-to-MSL corpus with a specific representation of the Spanish language for MSL interpretation. The proposed corpus is a referencedataset for evaluating diverse neural machine translation (NMT) system variants. With the support of grammatical MSL books and advice from MSL interpreters, this study developed a 3000 sentence pairs SPA-to-MSL dataset. The distribution of 3000 sentences in the corpus follows the linguistic composition of the Spanish language. With the aim of testing the functionality of the corpus as a data source for NMT, two neural transformers models for Spanish paraphrasis were used to test the usability of the proposed dataset. The first NMT model uses a Helsinki-NLP SPA-SPA transformer developed by the Language Technologies Research Group at the University of Helsinki. The second NMT model considers a Spa-to-Spa pre-trained neural transformer presented as a BARTOapproach. Both evaluations considered a transfer learning strategy, which has been demonstrated to be effective for modeling low-resource languages achieving state of art results in translation quality.Spanish-MSL glosses dataset -IT is a .xlsx format file that contains 3000 Spanish-MSL glosses pairs. To use dataset it needs to be converted to .csv formatModel M1- It is a Colab file that contains the programming methodology for finetunning Helsinki-NLP/opus-mt-es-es available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured. The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M2- It is a Colab file that contains the programming methodology for finetunning vgaraujov/bart-base-spanish available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured.The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-barto-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M1-split-version and Model M2-split-version is the dataset splitted in 80% training, 10% validation and 10% testing. Model cards are avilable at: https://huggingface.co/vania2911/esp-to-lsm-barto-model and https://huggingface.co/vania2911/esp-to-lsm-model-split.Translations M1 and M2 contain the reference and predicted translations for each model.
auto-retrain-input-dataset
huggingface.co
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huggingface Projects (2023). auto-retrain-input-dataset [Dataset]. https://huggingface.co/datasets/huggingface-projects/auto-retrain-input-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Huggingface Projects
Description
Dataset Card for "input-dataset"

More Information needed
h
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttp://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
databricks-dolly-15k
huggingface.co
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks (2023). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2023
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
instruction-pilot-outputs-greedy
huggingface.co
Updated Aug 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2022). instruction-pilot-outputs-greedy [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-pilot-outputs-greedy
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2022
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
Dataset Card for "instruction-pilot-outputs-greedy"

This dataset contains model outputs generated from the human demonstrations provided in HuggingFaceH4/instruction-pilot-prompts. To convert each language model into a dialogue agent, we prepended the following LangChain prompt to each input: The following is a friendly conversation between a human and an AI.
The AI is talkative and provides lots of specific details from its context.
If the AI does not know the answer to a… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/instruction-pilot-outputs-greedy.
h
pile-of-law
huggingface.co
opendatalab.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
Explore at:
Dataset updated
Jul 10, 2022
Dataset authored and provided by
Pile of Law
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
h
xlsum
huggingface.co
Updated Dec 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xlsum [Dataset]. https://huggingface.co/datasets/GEM/xlsum
Explore at:
Dataset updated
Dec 18, 2021
Dataset authored and provided by
GEM benchmark
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We present XLSum, a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
pythia-70m-rs
huggingface.co
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). pythia-70m-rs [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/pythia-70m-rs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
Dataset Card for "pythia-70m-rs"

More Information needed
document-visual-retrieval-test
huggingface.co
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Internal Testing Organization (2024). document-visual-retrieval-test [Dataset]. https://huggingface.co/datasets/hf-internal-testing/document-visual-retrieval-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Internal Testing Organization
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Model Card: Document Visual Retrieval Test (internal)

Dataset Overview

This dataset is designed to evaluate the performance of visual retrievers by testing their ability to match a query to a relevant image. Each of the three examples in this dataset contains a text query and an associated image, which is a scanned page from the foundational "Attention is All You Need" paper. The purpose of this dataset is to facilitate the evaluation of visual retrievers, where… See the full description on the dataset page: https://huggingface.co/datasets/hf-internal-testing/document-visual-retrieval-test.
helpful-self-instruct-raw
huggingface.co
Updated Feb 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). helpful-self-instruct-raw [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/helpful-self-instruct-raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "helpful-self-instruct-raw"

This dataset is derived from the finetuning subset of Self-Instruct, with some light formatting to remove trailing spaces and <|endoftext|> tokens.
pmp-se-test-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pmp-se-test-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/pmp-se-test-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Dataset Name
xglue
huggingface.co
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xglue [Dataset]. https://huggingface.co/datasets/microsoft/xglue
Explore at:
Dataset updated
Dec 2, 2020
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
XGLUE is a new benchmark dataset to evaluate the performance of cross-lingual pre-trained models with respect to cross-lingual natural language understanding and generation. The benchmark is composed of the following 11 tasks: - NER - POS Tagging (POS) - News Classification (NC) - MLQA - XNLI - PAWS-X - Query-Ad Matching (QADSM) - Web Page Ranking (WPR) - QA Matching (QAM) - Question Generation (QG) - News Title Generation (NTG)

For more information, please take a look at https://microsoft.github.io/XGLUE/.

Facebook

Twitter

Click to copy link

Link copied

Cite

Librarian Bots, model_cards_with_metadata [Dataset]. https://hf-proxy-cf.effarig.site/datasets/librarian-bots/model_cards_with_metadata

model_cards_with_metadata

Hugging Face Hub Model Cards

librarian-bots/model_cards_with_metadata

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

Librarian Bots

Description

Dataset Card for Hugging Face Hub Model Cards

This datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.

Clear search

Close search

Google apps

Main menu

model_cards_with_metadata

card

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Root directory

Dataset

RQ1

RQ2

RQ3

scripts

CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in...

model-card-sentences

AIMO-24: Model (openai-community/gpt2-large)

license: mit

GPT-2 Large

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Misuse and Out-of-scope Use

rs_test

stack-exchange-preferences

Spanish to Mexican Sign Language (MSL) glosses corpus for NLP tasks

auto-retrain-input-dataset

openai_humaneval

databricks-dolly-15k

instruction-pilot-outputs-greedy

pile-of-law

xlsum

pythia-70m-rs

document-visual-retrieval-test

helpful-self-instruct-raw

pmp-se-test-dataset

xglue

model_cards_with_metadataSee More Versions

Hugging Face Hub Model Cards

librarian-bots/model_cards_with_metadata

model_cards_with_metadata