100+ datasets found

h
load_dataset
huggingface.co
Updated Dec 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HANUEL GU (2024). load_dataset [Dataset]. https://huggingface.co/datasets/HANEUL999/load_dataset
Explore at:
Dataset updated
Dec 4, 2024
Authors
HANUEL GU
Description
HANEUL999/load_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
cats-image
huggingface.co
Updated Apr 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2022). cats-image [Dataset]. https://huggingface.co/datasets/huggingface/cats-image
Explore at:
Dataset updated
Apr 23, 2022
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
huggingface/cats-image dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-datasets-issues-2024-03-20
huggingface.co
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soonwook Hwang (2024). huggingface-datasets-issues-2024-03-20 [Dataset]. https://huggingface.co/datasets/hwang2006/huggingface-datasets-issues-2024-03-20
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 20, 2024
Authors
Soonwook Hwang
Description
Load Dataset from datasets import load_dataset

issues_dataset = load_dataset("hwang2006/huggingface-datasets-issues-2024-03-20", split="train")

DatasetGenerationError: An error occurred while generating the dataset

from huggingface_hub import hf_hub_url import pandas as pd from datasets import Dataset

data_files = hf_hub_url(repo_id="hwang2006/huggingface-datasets-issues-2024-03-20", filename="datasets-issues-with-comments.jsonl", repo_type="dataset") print(data_files)… See the full description on the dataset page: https://huggingface.co/datasets/hwang2006/huggingface-datasets-issues-2024-03-20.
h
huggingface_doc
huggingface.co
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aymeric Roucher (2024). huggingface_doc [Dataset]. https://huggingface.co/datasets/m-ric/huggingface_doc
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 19, 2024
Authors
Aymeric Roucher
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
m-ric/huggingface_doc dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MMLU-Pro
huggingface.co
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). MMLU-Pro [Dataset]. https://huggingface.co/datasets/rbiswasfc/MMLU-Pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2024
Authors
Raja Biswas
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is derived from TIGER-Lab/MMLU-Pro by running the following script: from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold

data_df = load_dataset("TIGER-Lab/MMLU-Pro", split="test").to_pandas() data_df = data_df[data_df["options"].apply(len) == 10].copy() data_df = data_df.reset_index(drop=True)

train-test split

def add_fold(df, n_splits=5, group_col="category"): skf = GroupKFold(n_splits=n_splits)

for f, (t_, v_) in… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/MMLU-Pro.
mt_bench_prompts
huggingface.co
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). mt_bench_prompts [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MT Bench by LMSYS

This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.

Dataset loading

To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")

Dataset creation

To create the dataset, we do the following for our internal tooling.

rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.
h
excelformer
huggingface.co
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiahuan Yan (2024). excelformer [Dataset]. https://huggingface.co/datasets/jyansir/excelformer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2024
Authors
Jiahuan Yan
Description
ExcelFormer Benchmark

The datasets used in ExcelFormer. The usage example is as follows: from datasets import load_dataset import pandas as pd import numpy as np

process train split, similar to other splits

data = {} datasets = load_dataset('jyansir/excelformer') # load 96 small-scale datasets in default

datasets = load_dataset('jyansir/excelformer', 'large') # load 21 large-scale datasets with specification

dataset = datasets['train'].to_dict() for table_name, table, task in… See the full description on the dataset page: https://huggingface.co/datasets/jyansir/excelformer.
h
the_cauldron
huggingface.co
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuggingFaceM4 (2024). the_cauldron [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/the_cauldron
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2024
Dataset authored and provided by
HuggingFaceM4
Description
Dataset Card for The Cauldron

Dataset description

The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.

Load the dataset

To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")

to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.
helpful-instructions
huggingface.co
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). helpful-instructions [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Helpful Instructions

Dataset Summary

Helpful Instructions is a dataset of (instruction, demonstration) pairs that are derived from public datasets. As the name suggests, it focuses on instructions that are "helpful", i.e. the kind of questions or tasks a human user might instruct an AI assistant to perform. You can load the dataset as follows: from datasets import load_dataset

Load all subsets

helpful_instructions =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions.
h
minds14
huggingface.co
Updated Apr 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PolyAI (2022). minds14 [Dataset]. https://huggingface.co/datasets/PolyAI/minds14
Explore at:
Dataset updated
Apr 24, 2022
Dataset authored and provided by
PolyAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MInDS-14

MINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.

Example

MInDS-14 can be downloaded and used as follows: from datasets import load_dataset

minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French

to download all data for multi-lingual fine-tuning uncomment following… See the full description on the dataset page: https://huggingface.co/datasets/PolyAI/minds14.
h
multi30k
huggingface.co
Updated Jul 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Trevett (2023). multi30k [Dataset]. https://huggingface.co/datasets/bentrevett/multi30k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 17, 2023
Authors
Ben Trevett
Description
Multi30k

This dataset contains the "multi30k" dataset, which is the "task 1" dataset from here. Each example consists of an "en" and a "de" feature. "en" is an English sentence, and "de" is the German translation of the English sentence.

Data Splits

The Multi30k dataset has 3 splits: train, validation, and test.

Dataset Split Number of Instances in Split

Train 29,000

Validation 1,014

Test 1,000

Citation Information… See the full description on the dataset page: https://huggingface.co/datasets/bentrevett/multi30k.
esb-datasets-test-only
huggingface.co
Updated Sep 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face for Audio (2023). esb-datasets-test-only [Dataset]. https://huggingface.co/datasets/hf-audio/esb-datasets-test-only
Explore at:
Dataset updated
Sep 9, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face for Audio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Hugging Face Datasets library: from datasets import load_dataset

librispeech = load_dataset("esb/datasets", "librispeech", split="train")

"esb/datasets": the repository namespace. This is fixed for all ESB datasets.

"librispeech": the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.

split="train": the split. Set this to one of… See the full description on the dataset page: https://huggingface.co/datasets/hf-audio/esb-datasets-test-only.
h
movie_QA
huggingface.co
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hi Truong (2025). movie_QA [Dataset]. https://huggingface.co/datasets/HiTruong/movie_QA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Authors
Hi Truong
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Name

This dataset contains 75.9k rows of question-answer pairs, split into training and testing sets.

Splits

train_v1: 20,000 rows train_v2: 20,000 rows train_v3: 20,000 rows test: 15,900 rows

Usage

You can load the dataset using the Hugging Face datasets library: from datasets import load_dataset

dataset = load_dataset("HiTruong/movie_QA")
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebrashttp://cerebras.ai/
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
h
covertype
huggingface.co
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mattia (2022). covertype [Dataset]. https://huggingface.co/datasets/mstz/covertype
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2022
Authors
Mattia
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Covertype

Classification of pixels into 7 forest cover types based on attributes such as elevation, aspect, slope, hillshade, soil-type, and more. The Covertype dataset from the UCI ML repository.

Configuration Task Description

covertype Multiclass classification Classify the area as one of 7 cover classes.

Usage

from datasets import load_dataset

dataset = load_dataset("mstz/covertype")["train"]
rlaif-v_formatted
huggingface.co
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2024). rlaif-v_formatted [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
from datasets import load_dataset, features

def format(examples): """ Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}] and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}]. Images are wrapped in a list. """ output = {"images": [], "prompt": [], "chosen": [], "rejected": []} for image, question, chosen, rejected in zip(examples["image"]… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted.
h
natural-questions
huggingface.co
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ir-datasets (2023). natural-questions [Dataset]. https://huggingface.co/datasets/irds/natural-questions
Explore at:
Dataset updated
Aug 4, 2023
Dataset authored and provided by
ir-datasets
Description
Dataset Card for natural-questions

The natural-questions dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

Data

This dataset provides:

docs (documents, i.e., the corpus); count=28,390,850

Usage

from datasets import load_dataset

docs = load_dataset('irds/natural-questions', 'docs') for record in docs: record # {'doc_id': ..., 'text': ..., 'html': ..., 'start_byte': ..., 'end_byte': ...… See the full description on the dataset page: https://huggingface.co/datasets/irds/natural-questions.
h
oasst1
huggingface.co
Updated Apr 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAssistant (2023). oasst1 [Dataset]. https://huggingface.co/datasets/OpenAssistant/oasst1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2023
Dataset authored and provided by
OpenAssistant
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenAssistant Conversations Dataset (OASST1)

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.
h
scientific_papers
huggingface.co
tensorflow.org
Updated Feb 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arman Cohan (2021). scientific_papers [Dataset]. https://huggingface.co/datasets/armanc/scientific_papers
Explore at:
Dataset updated
Feb 21, 2021
Authors
Arman Cohan
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features: - article: the body of the document, pagragraphs seperated by "/n". - abstract: the abstract of the document, pagragraphs seperated by "/n". - section_names: titles of sections, seperated by "/n".
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

Facebook

Twitter

Click to copy link

Link copied

Cite

HANUEL GU (2024). load_dataset [Dataset]. https://huggingface.co/datasets/HANEUL999/load_dataset

load_dataset

HANEUL999/load_dataset

Explore at:

Dataset updated

Dec 4, 2024

Authors

HANUEL GU

Description

HANEUL999/load_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

load_dataset

cats-image

huggingface-datasets-issues-2024-03-20

issues_dataset = load_dataset("hwang2006/huggingface-datasets-issues-2024-03-20", split="train")

DatasetGenerationError: An error occurred while generating the dataset

huggingface_doc

MMLU-Pro

train-test split

mt_bench_prompts

excelformer

process train split, similar to other splits

datasets = load_dataset('jyansir/excelformer', 'large') # load 21 large-scale datasets with specification

the_cauldron

helpful-instructions

Load all subsets

minds14

to download all data for multi-lingual fine-tuning uncomment following… See the full description on the dataset page: https://huggingface.co/datasets/PolyAI/minds14.

multi30k

esb-datasets-test-only

movie_QA

SlimPajama-627B

covertype

rlaif-v_formatted

natural-questions

oasst1

scientific_papers

dialogsum

load_datasetSee More Versions

HANEUL999/load_dataset

load_dataset