Facebook
TwitterHANEUL999/load_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhuggingface/cats-image dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterLoad Dataset from datasets import load_dataset
from huggingface_hub import hf_hub_url import pandas as pd from datasets import Dataset
data_files = hf_hub_url(repo_id="hwang2006/huggingface-datasets-issues-2024-03-20", filename="datasets-issues-with-comments.jsonl", repo_type="dataset") print(data_files)… See the full description on the dataset page: https://huggingface.co/datasets/hwang2006/huggingface-datasets-issues-2024-03-20.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
m-ric/huggingface_doc dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is derived from TIGER-Lab/MMLU-Pro by running the following script: from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold
data_df = load_dataset("TIGER-Lab/MMLU-Pro", split="test").to_pandas() data_df = data_df[data_df["options"].apply(len) == 10].copy() data_df = data_df.reset_index(drop=True)
def add_fold(df, n_splits=5, group_col="category"): skf = GroupKFold(n_splits=n_splits)
for f, (t_, v_) in… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/MMLU-Pro.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MT Bench by LMSYS
This set of evaluation prompts is created by the LMSYS org for better evaluation of chat models. For more information, see the paper.
Dataset loading
To load this dataset, use 🤗 datasets: from datasets import load_dataset data = load_dataset(HuggingFaceH4/mt_bench_prompts, split="train")
Dataset creation
To create the dataset, we do the following for our internal tooling.
rename turns to prompts, add empty reference to remaining prompts… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts.
Facebook
TwitterExcelFormer Benchmark
The datasets used in ExcelFormer. The usage example is as follows: from datasets import load_dataset import pandas as pd import numpy as np
data = {} datasets = load_dataset('jyansir/excelformer') # load 96 small-scale datasets in default
dataset = datasets['train'].to_dict() for table_name, table, task in… See the full description on the dataset page: https://huggingface.co/datasets/jyansir/excelformer.
Facebook
TwitterDataset Card for The Cauldron
Dataset description
The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Helpful Instructions
Dataset Summary
Helpful Instructions is a dataset of (instruction, demonstration) pairs that are derived from public datasets. As the name suggests, it focuses on instructions that are "helpful", i.e. the kind of questions or tasks a human user might instruct an AI assistant to perform. You can load the dataset as follows: from datasets import load_dataset
helpful_instructions =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MInDS-14
MINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.
Example
MInDS-14 can be downloaded and used as follows: from datasets import load_dataset
minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French
Facebook
TwitterMulti30k
This dataset contains the "multi30k" dataset, which is the "task 1" dataset from here. Each example consists of an "en" and a "de" feature. "en" is an English sentence, and "de" is the German translation of the English sentence.
Data Splits
The Multi30k dataset has 3 splits: train, validation, and test.
Dataset Split Number of Instances in Split
Train 29,000
Validation 1,014
Test 1,000
Citation Information… See the full description on the dataset page: https://huggingface.co/datasets/bentrevett/multi30k.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Hugging Face Datasets library: from datasets import load_dataset
librispeech = load_dataset("esb/datasets", "librispeech", split="train")
"esb/datasets": the repository namespace. This is fixed for all ESB datasets.
"librispeech": the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.
split="train": the split. Set this to one of… See the full description on the dataset page: https://huggingface.co/datasets/hf-audio/esb-datasets-test-only.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Name
This dataset contains 75.9k rows of question-answer pairs, split into training and testing sets.
Splits
train_v1: 20,000 rows train_v2: 20,000 rows train_v3: 20,000 rows test: 15,900 rows
Usage
You can load the dataset using the Hugging Face datasets library: from datasets import load_dataset
dataset = load_dataset("HiTruong/movie_QA")
Facebook
TwitterThe dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Covertype
Classification of pixels into 7 forest cover types based on attributes such as elevation, aspect, slope, hillshade, soil-type, and more. The Covertype dataset from the UCI ML repository.
Configuration Task Description
covertype Multiclass classification Classify the area as one of 7 cover classes.
Usage
from datasets import load_dataset
dataset = load_dataset("mstz/covertype")["train"]
Facebook
Twitterfrom datasets import load_dataset, features
def format(examples): """ Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}] and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}]. Images are wrapped in a list. """ output = {"images": [], "prompt": [], "chosen": [], "rejected": []} for image, question, chosen, rejected in zip(examples["image"]… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted.
Facebook
TwitterDataset Card for natural-questions
The natural-questions dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.
Data
This dataset provides:
docs (documents, i.e., the corpus); count=28,390,850
Usage
from datasets import load_dataset
docs = load_dataset('irds/natural-questions', 'docs') for record in docs: record # {'doc_id': ..., 'text': ..., 'html': ..., 'start_byte': ..., 'end_byte': ...… See the full description on the dataset page: https://huggingface.co/datasets/irds/natural-questions.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenAssistant Conversations Dataset (OASST1)
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features: - article: the body of the document, pagragraphs seperated by "/n". - abstract: the abstract of the document, pagragraphs seperated by "/n". - section_names: titles of sections, seperated by "/n".
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for DIALOGSum Corpus
Dataset Description
Links
Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick
Dataset Summary
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
Facebook
TwitterHANEUL999/load_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community