100+ datasets found
  1. h

    load_dataset

    • huggingface.co
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HANUEL GU (2024). load_dataset [Dataset]. https://huggingface.co/datasets/HANEUL999/load_dataset
    Explore at:
    Dataset updated
    Dec 4, 2024
    Authors
    HANUEL GU
    Description

    HANEUL999/load_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. cats-image

    • huggingface.co
    Updated Apr 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2022). cats-image [Dataset]. https://huggingface.co/datasets/huggingface/cats-image
    Explore at:
    Dataset updated
    Apr 23, 2022
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    huggingface/cats-image dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    huggingface-datasets-issues-2024-03-20

    • huggingface.co
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soonwook Hwang (2024). huggingface-datasets-issues-2024-03-20 [Dataset]. https://huggingface.co/datasets/hwang2006/huggingface-datasets-issues-2024-03-20
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2024
    Authors
    Soonwook Hwang
    Description

    Load Dataset from datasets import load_dataset

    issues_dataset = load_dataset("hwang2006/huggingface-datasets-issues-2024-03-20", split="train")

    DatasetGenerationError: An error occurred while generating the dataset

    from huggingface_hub import hf_hub_url import pandas as pd from datasets import Dataset

    data_files = hf_hub_url(repo_id="hwang2006/huggingface-datasets-issues-2024-03-20", filename="datasets-issues-with-comments.jsonl", repo_type="dataset") print(data_files)… See the full description on the dataset page: https://huggingface.co/datasets/hwang2006/huggingface-datasets-issues-2024-03-20.

  4. h

    huggingface_doc

    • huggingface.co
    Updated Jan 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aymeric Roucher (2024). huggingface_doc [Dataset]. https://huggingface.co/datasets/m-ric/huggingface_doc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2024
    Authors
    Aymeric Roucher
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    m-ric/huggingface_doc dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    MMLU-Pro

    • huggingface.co
    Updated Jul 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). MMLU-Pro [Dataset]. https://huggingface.co/datasets/rbiswasfc/MMLU-Pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2024
    Authors
    Raja Biswas
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is derived from TIGER-Lab/MMLU-Pro by running the following script: from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold

    data_df = load_dataset("TIGER-Lab/MMLU-Pro", split="test").to_pandas() data_df = data_df[data_df["options"].apply(len) == 10].copy() data_df = data_df.reset_index(drop=True)

    train-test split

    def add_fold(df, n_splits=5, group_col="category"): skf = GroupKFold(n_splits=n_splits)

    for f, (t_, v_) in… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/MMLU-Pro.
    
  6. mt_bench_prompts

    • huggingface.co
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). mt_bench_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MT Bench by LMSYS

    This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.

      Dataset loading
    

    To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")

      Dataset creation
    

    To create the dataset, we do the following for our internal tooling.

    rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.

  7. h

    excelformer

    • huggingface.co
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiahuan Yan (2024). excelformer [Dataset]. https://huggingface.co/datasets/jyansir/excelformer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2024
    Authors
    Jiahuan Yan
    Description

    ExcelFormer Benchmark

    The datasets used in ExcelFormer. The usage example is as follows: from datasets import load_dataset import pandas as pd import numpy as np

    process train split, similar to other splits

    data = {} datasets = load_dataset('jyansir/excelformer') # load 96 small-scale datasets in default

    datasets = load_dataset('jyansir/excelformer', 'large') # load 21 large-scale datasets with specification

    dataset = datasets['train'].to_dict() for table_name, table, task in… See the full description on the dataset page: https://huggingface.co/datasets/jyansir/excelformer.

  8. h

    the_cauldron

    • huggingface.co
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuggingFaceM4 (2024). the_cauldron [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/the_cauldron
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2024
    Dataset authored and provided by
    HuggingFaceM4
    Description

    Dataset Card for The Cauldron

      Dataset description
    

    The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.

      Load the dataset
    

    To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")

    to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.

  9. helpful-instructions

    • huggingface.co
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). helpful-instructions [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Helpful Instructions

      Dataset Summary
    

    Helpful Instructions is a dataset of (instruction, demonstration) pairs that are derived from public datasets. As the name suggests, it focuses on instructions that are "helpful", i.e. the kind of questions or tasks a human user might instruct an AI assistant to perform. You can load the dataset as follows: from datasets import load_dataset

    Load all subsets

    helpful_instructions =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions.

  10. h

    minds14

    • huggingface.co
    Updated Apr 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PolyAI (2022). minds14 [Dataset]. https://huggingface.co/datasets/PolyAI/minds14
    Explore at:
    Dataset updated
    Apr 24, 2022
    Dataset authored and provided by
    PolyAI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MInDS-14

    MINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.

      Example
    

    MInDS-14 can be downloaded and used as follows: from datasets import load_dataset

    minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French

    to download all data for multi-lingual fine-tuning uncomment following… See the full description on the dataset page: https://huggingface.co/datasets/PolyAI/minds14.

  11. h

    multi30k

    • huggingface.co
    Updated Jul 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Trevett (2023). multi30k [Dataset]. https://huggingface.co/datasets/bentrevett/multi30k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2023
    Authors
    Ben Trevett
    Description

    Multi30k

    This dataset contains the "multi30k" dataset, which is the "task 1" dataset from here. Each example consists of an "en" and a "de" feature. "en" is an English sentence, and "de" is the German translation of the English sentence.

      Data Splits
    

    The Multi30k dataset has 3 splits: train, validation, and test.

    Dataset Split Number of Instances in Split

    Train 29,000

    Validation 1,014

    Test 1,000

      Citation Information… See the full description on the dataset page: https://huggingface.co/datasets/bentrevett/multi30k.
    
  12. esb-datasets-test-only

    • huggingface.co
    Updated Sep 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face for Audio (2023). esb-datasets-test-only [Dataset]. https://huggingface.co/datasets/hf-audio/esb-datasets-test-only
    Explore at:
    Dataset updated
    Sep 9, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face for Audio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Hugging Face Datasets library: from datasets import load_dataset

    librispeech = load_dataset("esb/datasets", "librispeech", split="train")

    "esb/datasets": the repository namespace. This is fixed for all ESB datasets.

    "librispeech": the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.

    split="train": the split. Set this to one of… See the full description on the dataset page: https://huggingface.co/datasets/hf-audio/esb-datasets-test-only.

  13. h

    movie_QA

    • huggingface.co
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hi Truong (2025). movie_QA [Dataset]. https://huggingface.co/datasets/HiTruong/movie_QA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Authors
    Hi Truong
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Name

    This dataset contains 75.9k rows of question-answer pairs, split into training and testing sets.

      Splits
    

    train_v1: 20,000 rows train_v2: 20,000 rows train_v3: 20,000 rows test: 15,900 rows

      Usage
    

    You can load the dataset using the Hugging Face datasets library: from datasets import load_dataset

    dataset = load_dataset("HiTruong/movie_QA")

  14. SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebrashttp://cerebras.ai/
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  15. h

    covertype

    • huggingface.co
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattia (2022). covertype [Dataset]. https://huggingface.co/datasets/mstz/covertype
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2022
    Authors
    Mattia
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Covertype

    Classification of pixels into 7 forest cover types based on attributes such as elevation, aspect, slope, hillshade, soil-type, and more. The Covertype dataset from the UCI ML repository.

    Configuration Task Description

    covertype Multiclass classification Classify the area as one of 7 cover classes.

      Usage
    

    from datasets import load_dataset

    dataset = load_dataset("mstz/covertype")["train"]

  16. rlaif-v_formatted

    • huggingface.co
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2024). rlaif-v_formatted [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    from datasets import load_dataset, features

    def format(examples): """ Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}] and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}]. Images are wrapped in a list. """ output = {"images": [], "prompt": [], "chosen": [], "rejected": []} for image, question, chosen, rejected in zip(examples["image"]… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted.

  17. h

    natural-questions

    • huggingface.co
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ir-datasets (2023). natural-questions [Dataset]. https://huggingface.co/datasets/irds/natural-questions
    Explore at:
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    ir-datasets
    Description

    Dataset Card for natural-questions

    The natural-questions dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

      Data
    

    This dataset provides:

    docs (documents, i.e., the corpus); count=28,390,850

      Usage
    

    from datasets import load_dataset

    docs = load_dataset('irds/natural-questions', 'docs') for record in docs: record # {'doc_id': ..., 'text': ..., 'html': ..., 'start_byte': ..., 'end_byte': ...… See the full description on the dataset page: https://huggingface.co/datasets/irds/natural-questions.

  18. h

    oasst1

    • huggingface.co
    Updated Apr 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAssistant (2023). oasst1 [Dataset]. https://huggingface.co/datasets/OpenAssistant/oasst1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2023
    Dataset authored and provided by
    OpenAssistant
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenAssistant Conversations Dataset (OASST1)

      Dataset Summary
    

    In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.

  19. h

    scientific_papers

    • huggingface.co
    • tensorflow.org
    Updated Feb 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arman Cohan (2021). scientific_papers [Dataset]. https://huggingface.co/datasets/armanc/scientific_papers
    Explore at:
    Dataset updated
    Feb 21, 2021
    Authors
    Arman Cohan
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

    Both "arxiv" and "pubmed" have two features: - article: the body of the document, pagragraphs seperated by "/n". - abstract: the abstract of the document, pagragraphs seperated by "/n". - section_names: titles of sections, seperated by "/n".

  20. h

    dialogsum

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Karthick Kaliannan Neelamohan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for DIALOGSum Corpus

      Dataset Description
    
    
    
    
    
      Links
    

    Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

      Dataset Summary
    

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HANUEL GU (2024). load_dataset [Dataset]. https://huggingface.co/datasets/HANEUL999/load_dataset

load_dataset

HANEUL999/load_dataset

Explore at:
Dataset updated
Dec 4, 2024
Authors
HANUEL GU
Description

HANEUL999/load_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu