100+ datasets found
  1. Data from: huggingface

    • kaggle.com
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    amulil
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by amulil

    Released under GPL 2

    Contents

  2. finevideo

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face FineVideo (2024). finevideo [Dataset]. https://huggingface.co/datasets/HuggingFaceFV/finevideo
    Explore at:
    Dataset updated
    Sep 12, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face FineVideo
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    FineVideo

    FineVideo Description Dataset Explorer Revisions Dataset Distribution

    How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset

    Dataset StructureData Instances Data Fields

    Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases

    Additional Information Credits Future Work Opting out of FineVideo Citation Information

    Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.

  3. h

    ShareGPT_Vicuna_unfiltered

    • huggingface.co
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    z. (2023). ShareGPT_Vicuna_unfiltered [Dataset]. https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
    Explore at:
    Dataset updated
    Apr 12, 2023
    Authors
    z.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices:

    Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.

  4. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  5. Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    ## Root directory

    - `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

    - `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

    - `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    ## Dataset

    - `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

    - `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

    - `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

    - `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

    - `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    ## RQ1

    - `RQ1/RQ1_dataset-list.txt`: list of HF datasets

    - `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

    - `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

    - `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

    - `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

    - `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

    ## RQ2

    - `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

    - `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

    - `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

    - `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

    - `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    ## RQ3

    - `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

    - `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

    - `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

    - `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

    - `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

    - `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

    ## scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  6. h

    LAV-DF

    • huggingface.co
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ControlNet (2023). LAV-DF [Dataset]. https://huggingface.co/datasets/ControlNet/LAV-DF
    Explore at:
    Dataset updated
    Jul 11, 2023
    Authors
    ControlNet
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Localized Audio Visual DeepFake Dataset (LAV-DF)

    This repo is the dataset for the DICTA paper Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization (Best Award), and the journal paper "Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization submitted to CVIU.

      LAV-DF Dataset
    
    
    
    
    
    
    
      Download
    

    To use this LAV-DF dataset, you should… See the full description on the dataset page: https://huggingface.co/datasets/ControlNet/LAV-DF.

  7. h

    MVBench

    • huggingface.co
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2024). MVBench [Dataset]. https://huggingface.co/datasets/OpenGVLab/MVBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    OpenGVLab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MVBench

      Important Update
    

    [18/10/2024] Due to NTU RGB+D License, 320 videos from NTU RGB+D need to be downloaded manually. Please visit ROSE Lab to access the data. We also provide a list of the 320 videos used in MVBench for your reference.

    We introduce a novel static-to-dynamic method for defining temporal-related tasks. By converting static tasks into dynamic ones, we facilitate systematic generation of video tasks necessitating a wide range of temporal abilities, from… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/MVBench.

  8. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  9. openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  10. h

    tiny-imagenet

    • huggingface.co
    • datasets.activeloop.ai
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Zheng (2022). tiny-imagenet [Dataset]. https://huggingface.co/datasets/zh-plus/tiny-imagenet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2022
    Authors
    Hao Zheng
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    Dataset Card for tiny-imagenet

      Dataset Summary
    

    Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

      Languages
    

    The class labels in the dataset are in English.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    { 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.

  11. h

    rag-mini-wikipedia

    • huggingface.co
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  12. h

    SHP

    • huggingface.co
    • opendatalab.com
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2023). SHP [Dataset]. https://huggingface.co/datasets/stanfordnlp/SHP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Dataset authored and provided by
    Stanford NLP
    Description

    🚢 Stanford Human Preferences Dataset (SHP)

    If you mention this dataset in a paper, please cite the paper: Understanding Dataset Difficulty with V-Usable Information (ICML 2022).

      Summary
    

    SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/SHP.

  13. h

    imagenet_1k_resized_256

    • huggingface.co
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evan (2025). imagenet_1k_resized_256 [Dataset]. https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Authors
    Evan
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imagenet_1k_resized_256"

      Dataset summary
    

    The same ImageNet dataset but all the smaller side resized to 256. A lot of pretraining workflows contain resizing images to 256 and random cropping to 224x224, this is why 256 is chosen. The resized dataset can also be downloaded much faster and consume less space than the original one. See here for detailed readme.

      Dataset Structure
    

    Below is the example of one row of data. Note that the labels in… See the full description on the dataset page: https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256.

  14. ZOD

    • huggingface.co
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenseact (2024). ZOD [Dataset]. https://huggingface.co/datasets/Zenseact/ZOD
    Explore at:
    Dataset updated
    Mar 12, 2024
    Dataset authored and provided by
    Zenseact
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Zenseact Open Dataset (ZOD) is a large multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European counties, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping. Together with the data, we have developed a SDK containing tutorials, downloading functionality, and a dataset API for easy access to the data. The development kit is available on Github.

  15. FStarDataSet-V2

    • huggingface.co
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    This dataset is the Version 2.0 of microsoft/FStarDataSet.

      Primary-Objective
    

    This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

      Data Format
    

    Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.

  16. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  17. h

    InternVid

    • huggingface.co
    • opendatalab.com
    Updated Jul 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2023). InternVid [Dataset]. https://huggingface.co/datasets/OpenGVLab/InternVid
    Explore at:
    Dataset updated
    Jul 17, 2023
    Dataset authored and provided by
    OpenGVLab
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    InternVid

      InternVid-10M-FLT
    

    We present InternVid-10M-FLT, a subset of this dataset, consisting of 10 million video clips, with generated high-quality captions for publicly available web videos.

      Download
    

    The 10M samples are provided in jsonlines file. Columns include the videoID, timestamps, generated caption and their UMT similarity scores.\

      How to Use
    

    from datasets import load_dataset dataset = load_dataset("OpenGVLab/InternVid")

      Method… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/InternVid.
    
  18. h

    LLaVA-NeXT-Data

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMMs-Lab, LLaVA-NeXT-Data [Dataset]. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    LMMs-Lab
    Description

    Dataset Card for LLaVA-NeXT

    We provide the whole details of LLaVA-NeXT Dataset. In this dataset, we include the data that was used in the instruction tuning stage for LLaVA-NeXT and LLaVA-NeXT(stronger). Aug 30, 2024: We update the dataset with raw format (de-compress it for json file and images with structured folder), you can directly download them if you are familiar with LLaVA data format.

      Dataset Sources
    

    Compared to the instruction data mixture for LLaVA-1.5… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data.

  19. esb-datasets-test-only

    • huggingface.co
    Updated Sep 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face for Audio (2023). esb-datasets-test-only [Dataset]. https://huggingface.co/datasets/hf-audio/esb-datasets-test-only
    Explore at:
    Dataset updated
    Sep 9, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face for Audio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Hugging Face Datasets library: from datasets import load_dataset

    librispeech = load_dataset("esb/datasets", "librispeech", split="train")

    "esb/datasets": the repository namespace. This is fixed for all ESB datasets.

    "librispeech": the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.

    split="train": the split. Set this to one of… See the full description on the dataset page: https://huggingface.co/datasets/hf-audio/esb-datasets-test-only.

  20. h

    alpaca

    • huggingface.co
    • opendatalab.com
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    Tatsu Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

    The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
Organization logo

Data from: huggingface

download pretrained model files

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
amulil
License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

Dataset

This dataset was created by amulil

Released under GPL 2

Contents

Search
Clear search
Close search
Google apps
Main menu