100+ datasets found
  1. e

    model_cards_with_metadata

    • hf-proxy-cf.effarig.site
    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Librarian Bots, model_cards_with_metadata [Dataset]. https://hf-proxy-cf.effarig.site/datasets/librarian-bots/model_cards_with_metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Librarian Bots
    Description

    Dataset Card for Hugging Face Hub Model Cards

    This datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.

  2. h

    card

    • huggingface.co
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    card [Dataset]. https://huggingface.co/datasets/chen1914/card
    Explore at:
    Dataset updated
    Jan 20, 2025
    Authors
    fang chen
    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

      Dataset Sources [optional]
    

    Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/chen1914/card.

  3. Z

    Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mastropaolo, Antonio (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200098
    Explore at:
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Mastropaolo, Antonio
    Pepe, Federica
    Nardone, Vittoria
    Di Penta, Massimiliano
    Canfora, Gerardo
    BAVOTA, Gabriele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    Root directory

    • statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
    • modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)
    • script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    Dataset

    • Dataset/Dataset_HF-models-list.csv: list of HF models analyzed
    • Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library
    • Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model
    • Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project
    • Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    RQ1

    • RQ1/RQ1_dataset-list.txt: list of HF datasets
    • RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets
    • RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script
    • RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
    • RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py
    • RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

    RQ2

    • RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task
    • RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling
    • RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias
    • RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories
    • RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    RQ3

    • RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses
    • RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness
    • RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name
    • RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
    • RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)
    • RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

    scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  4. CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in...

    • zenodo.org
    bin, zip
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudio Di Sipio; Claudio Di Sipio; Juri Di Rocco; Juri Di Rocco; Davide Di Ruscio; Davide Di Ruscio; Stefano Palombo; Stefano Palombo (2024). CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem [Dataset]. http://doi.org/10.5281/zenodo.14267550
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Claudio Di Sipio; Claudio Di Sipio; Juri Di Rocco; Juri Di Rocco; Davide Di Ruscio; Davide Di Ruscio; Stefano Palombo; Stefano Palombo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question. In particular, many of them are used in toy projects or simply as a mirror for the HF repository. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects.
    This artifact contains CodeXHug, a curated dataset of HuggingFace PTMs exploited in the GitHub ecosystem. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the GitHub platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 372,063 Python files. We also present a statistical analysis of the dataset, highlighting the most popular PTMs and the most common tasks for which they are used. Finally, we discuss the research opportunities enabled by CodeXHug and the implications of our findings for the software engineering community.

  5. h

    model-card-sentences

    • huggingface.co
    Updated Nov 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    model-card-sentences [Dataset]. https://huggingface.co/datasets/librarian-bots/model-card-sentences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2023
    Dataset authored and provided by
    Librarian Bots
    Description

    librarian-bots/model-card-sentences dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. AIMO-24: Model (openai-community/gpt2-large)

    • kaggle.com
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    Dinh Thoai Tran @ randrise.com
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    language: en

    license: mit

    GPT-2 Large

    Table of Contents

    Model Details

    Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

    How to Get Started with the Model

    Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='gpt2-large')
    >>> set_seed(42)
    >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    
    [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
     {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
     {'generated_text': "Hello, I'm a language model, why does this matter for you?
    
    When I hear new languages, I tend to start thinking in terms"},
     {'generated_text': "Hello, I'm a language model, a functional language...
    
    I don't need to know anything else. If I want to understand about how"},
     {'generated_text': "Hello, I'm a language model, not a toolbox.
    
    In a nutshell, a language model is a set of attributes that define how"}]
    

    Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import GPT2Tokenizer, GPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = GPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    and in TensorFlow:

    from transformers import GPT2Tokenizer, TFGPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = TFGPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

    Uses

    Direct Use

    In their model card about GPT-2, OpenAI wrote:

    The primary intended users of these models are AI researchers and practitioners.

    We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

    Downstream Use

    In their model card about GPT-2, OpenAI wrote:

    Here are some secondary use cases we believe are likely:

    • Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
    • Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
    • Entertainment: Creation of games, chat bots, and amusing generations.

    Misuse and Out-of-scope Use

    In their model card about GPT-2, OpenAI wrote:

    Because large-scale language models like GPT-2 ...

  7. rs_test

    • huggingface.co
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). rs_test [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/rs_test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for HuggingFaceH4/rs_test

    SFT model: HuggingFaceH4/falcon-40b-ift-v3.1 Reward model: HuggingFaceH4/pythia-70m-rm-v0.0 Temperature: 0.7

  8. stack-exchange-preferences

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for H4 Stack Exchange Preferences Dataset

      Dataset Summary
    

    This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  9. S

    Spanish to Mexican Sign Language (MSL) glosses corpus for NLP tasks

    • scidb.cn
    • figshare.com
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diana Vania Lara Ortiz; Jorge Isaac Chairez Oria; Rita Fuentes Quetziquel Aguilar (2025). Spanish to Mexican Sign Language (MSL) glosses corpus for NLP tasks [Dataset]. http://doi.org/10.57760/sciencedb.21522
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Diana Vania Lara Ortiz; Jorge Isaac Chairez Oria; Rita Fuentes Quetziquel Aguilar
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Mexico
    Description

    This work shares a dataset that contains Spanish (SPA) to Mexican Sign Language (MSL) glosses -transcripted MSL- pairs of sentences for a downstream task. The methodology used to prepare the shared dataset considered the construction of SPA-to-MSL corpus with a specific representation of the Spanish language for MSL interpretation. The proposed corpus is a referencedataset for evaluating diverse neural machine translation (NMT) system variants. With the support of grammatical MSL books and advice from MSL interpreters, this study developed a 3000 sentence pairs SPA-to-MSL dataset. The distribution of 3000 sentences in the corpus follows the linguistic composition of the Spanish language. With the aim of testing the functionality of the corpus as a data source for NMT, two neural transformers models for Spanish paraphrasis were used to test the usability of the proposed dataset. The first NMT model uses a Helsinki-NLP SPA-SPA transformer developed by the Language Technologies Research Group at the University of Helsinki. The second NMT model considers a Spa-to-Spa pre-trained neural transformer presented as a BARTOapproach. Both evaluations considered a transfer learning strategy, which has been demonstrated to be effective for modeling low-resource languages achieving state of art results in translation quality.Spanish-MSL glosses dataset -IT is a .xlsx format file that contains 3000 Spanish-MSL glosses pairs. To use dataset it needs to be converted to .csv formatModel M1- It is a Colab file that contains the programming methodology for finetunning Helsinki-NLP/opus-mt-es-es available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured. The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M2- It is a Colab file that contains the programming methodology for finetunning vgaraujov/bart-base-spanish available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured.The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-barto-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M1-split-version and Model M2-split-version is the dataset splitted in 80% training, 10% validation and 10% testing. Model cards are avilable at: https://huggingface.co/vania2911/esp-to-lsm-barto-model and https://huggingface.co/vania2911/esp-to-lsm-model-split.Translations M1 and M2 contain the reference and predicted translations for each model.

  10. auto-retrain-input-dataset

    • huggingface.co
    Updated Jan 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huggingface Projects (2023). auto-retrain-input-dataset [Dataset]. https://huggingface.co/datasets/huggingface-projects/auto-retrain-input-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Huggingface Projects
    Description

    Dataset Card for "input-dataset"

    More Information needed

  11. h

    openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  12. databricks-dolly-15k

    • huggingface.co
    Updated Apr 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks (2023). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2023
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  13. instruction-pilot-outputs-greedy

    • huggingface.co
    Updated Aug 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2022). instruction-pilot-outputs-greedy [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-pilot-outputs-greedy
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2022
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for "instruction-pilot-outputs-greedy"

    This dataset contains model outputs generated from the human demonstrations provided in HuggingFaceH4/instruction-pilot-prompts. To convert each language model into a dialogue agent, we prepended the following LangChain prompt to each input: The following is a friendly conversation between a human and an AI.
    The AI is talkative and provides lots of specific details from its context.
    If the AI does not know the answer to a… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/instruction-pilot-outputs-greedy.

  14. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  15. h

    xlsum

    • huggingface.co
    Updated Dec 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xlsum [Dataset]. https://huggingface.co/datasets/GEM/xlsum
    Explore at:
    Dataset updated
    Dec 18, 2021
    Dataset authored and provided by
    GEM benchmark
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We present XLSum, a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

  16. pythia-70m-rs

    • huggingface.co
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). pythia-70m-rs [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/pythia-70m-rs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for "pythia-70m-rs"

    More Information needed

  17. document-visual-retrieval-test

    • huggingface.co
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Internal Testing Organization (2024). document-visual-retrieval-test [Dataset]. https://huggingface.co/datasets/hf-internal-testing/document-visual-retrieval-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Internal Testing Organization
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Model Card: Document Visual Retrieval Test (internal)

      Dataset Overview
    

    This dataset is designed to evaluate the performance of visual retrievers by testing their ability to match a query to a relevant image. Each of the three examples in this dataset contains a text query and an associated image, which is a scanned page from the foundational "Attention is All You Need" paper. The purpose of this dataset is to facilitate the evaluation of visual retrievers, where… See the full description on the dataset page: https://huggingface.co/datasets/hf-internal-testing/document-visual-retrieval-test.

  18. helpful-self-instruct-raw

    • huggingface.co
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). helpful-self-instruct-raw [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/helpful-self-instruct-raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "helpful-self-instruct-raw"

    This dataset is derived from the finetuning subset of Self-Instruct, with some light formatting to remove trailing spaces and <|endoftext|> tokens.

  19. pmp-se-test-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pmp-se-test-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/pmp-se-test-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

  20. xglue

    • huggingface.co
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xglue [Dataset]. https://huggingface.co/datasets/microsoft/xglue
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    XGLUE is a new benchmark dataset to evaluate the performance of cross-lingual pre-trained models with respect to cross-lingual natural language understanding and generation. The benchmark is composed of the following 11 tasks: - NER - POS Tagging (POS) - News Classification (NC) - MLQA - XNLI - PAWS-X - Query-Ad Matching (QADSM) - Web Page Ranking (WPR) - QA Matching (QAM) - Question Generation (QG) - News Title Generation (NTG)

    For more information, please take a look at https://microsoft.github.io/XGLUE/.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Librarian Bots, model_cards_with_metadata [Dataset]. https://hf-proxy-cf.effarig.site/datasets/librarian-bots/model_cards_with_metadata

model_cards_with_metadata

Hugging Face Hub Model Cards

librarian-bots/model_cards_with_metadata

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Librarian Bots
Description

Dataset Card for Hugging Face Hub Model Cards

This datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.

Search
Clear search
Close search
Google apps
Main menu