100+ datasets found

finevideo
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face FineVideo (2024). finevideo [Dataset]. https://huggingface.co/datasets/HuggingFaceFV/finevideo
Explore at:
Dataset updated
Sep 12, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face FineVideo
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
FineVideo

FineVideo Description Dataset Explorer Revisions Dataset Distribution

How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset

Dataset StructureData Instances Data Fields

Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases

Additional Information Credits Future Work Opting out of FineVideo Citation Information

Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
databricks-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
ultrachat_200k
huggingface.co
opendatalab.com
+1more
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
h
oak
huggingface.co
Updated Jul 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tabularisai (2024). oak [Dataset]. https://huggingface.co/datasets/tabularisai/oak
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2024
Dataset authored and provided by
tabularisai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
NEWS:

A new version of the dataset with 120,000,000 more tokens is upload: OAK v1.1

Open Artificial Knowledge (OAK) Dataset Overview

The Open Artificial Knowledge (OAK) dataset is a large-scale resource of over 650 Millions tokens designed to address the challenges of acquiring high-quality, diverse, and ethically sourced training data for Large Language Models (LLMs). OAK leverages an ensemble of state-of-the-art LLMs to generate high-quality text… See the full description on the dataset page: https://huggingface.co/datasets/tabularisai/oak.
h
wikihow
huggingface.co
paperswithcode.com
+2more
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Yang Wang (2024). wikihow [Dataset]. https://huggingface.co/datasets/wangwilliamyang/wikihow
Explore at:
Dataset updated
Mar 15, 2024
Authors
William Yang Wang
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
h
imagenet_1k_resized_256
huggingface.co
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evan (2025). imagenet_1k_resized_256 [Dataset]. https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2025
Authors
Evan
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imagenet_1k_resized_256"

Dataset summary

The same ImageNet dataset but all the smaller side resized to 256. A lot of pretraining workflows contain resizing images to 256 and random cropping to 224x224, this is why 256 is chosen. The resized dataset can also be downloaded much faster and consume less space than the original one. See here for detailed readme.

Dataset Structure

Below is the example of one row of data. Note that the labels in… See the full description on the dataset page: https://huggingface.co/datasets/evanarlian/imagenet_1k_resized_256.
h
LAV-DF
huggingface.co
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ControlNet (2023). LAV-DF [Dataset]. https://huggingface.co/datasets/ControlNet/LAV-DF
Explore at:
Dataset updated
Jul 11, 2023
Authors
ControlNet
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Localized Audio Visual DeepFake Dataset (LAV-DF)

This repo is the dataset for the DICTA paper Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization (Best Award), and the journal paper "Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization submitted to CVIU.

LAV-DF Dataset Download

To use this LAV-DF dataset, you should… See the full description on the dataset page: https://huggingface.co/datasets/ControlNet/LAV-DF.
h
MVBench
huggingface.co
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenGVLab (2024). MVBench [Dataset]. https://huggingface.co/datasets/OpenGVLab/MVBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2024
Dataset authored and provided by
OpenGVLab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MVBench

Important Update

[18/10/2024] Due to NTU RGB+D License, 320 videos from NTU RGB+D need to be downloaded manually. Please visit ROSE Lab to access the data. We also provide a list of the 320 videos used in MVBench for your reference.

We introduce a novel static-to-dynamic method for defining temporal-related tasks. By converting static tasks into dynamic ones, we facilitate systematic generation of video tasks necessitating a wide range of temporal abilities, from… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/MVBench.
h
alpaca
huggingface.co
opendatalab.com
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
h
VisIT-Bench
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ML Foundations, VisIT-Bench [Dataset]. https://huggingface.co/datasets/mlfoundations/VisIT-Bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
ML Foundations
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for VisIT-Bench

Dataset Description Links Dataset Structure Data Fields Data Splits Data Loading

Licensing Information Annotations Considerations for Using the Data Citation Information

Dataset Description

VisIT-Bench is a dataset and benchmark for vision-and-language instruction following. The dataset is comprised of image-instruction pairs and corresponding example outputs, spanning a wide range of tasks, from simple object recognition to complex… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/VisIT-Bench.
cos_e
huggingface.co
Updated Dec 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salesforce (2021). cos_e [Dataset]. https://huggingface.co/datasets/Salesforce/cos_e
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2021
Dataset provided by
Salesforce Inchttp://salesforce.com/
Authors
Salesforce
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "cos_e"

Dataset Summary

Common Sense Explanations (CoS-E) allows for training language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances v1.0

Size of downloaded dataset… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/cos_e.
h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
h
ShareGPT-4o
huggingface.co
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenGVLab (2024). ShareGPT-4o [Dataset]. https://huggingface.co/datasets/OpenGVLab/ShareGPT-4o
Explore at:
Dataset updated
May 28, 2024
Dataset authored and provided by
OpenGVLab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
OpenGVLab/ShareGPT-4o dataset hosted on Hugging Face and contributed by the HF Datasets community
h
coco
huggingface.co
Updated Mar 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Detection datasets (2023). coco [Dataset]. https://huggingface.co/datasets/detection-datasets/coco
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 3, 2023
Dataset authored and provided by
Detection datasets
Description
detection-datasets/coco dataset hosted on Hugging Face and contributed by the HF Datasets community
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
youtube_regrets
huggingface.co
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2023). youtube_regrets [Dataset]. http://doi.org/10.57967/hf/2217
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2217
Dataset updated
May 16, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Mozilla RegretsReporter Public Data

Dataset Summary RegretsReporter Data

This data set card describes the public data sets made available based on Mozilla’s RegretsReporter research as well as the Viu Política research from the University of Exeter and Vero Instituto. This data was collected from participants in Mozilla’s RegretsReporter studies. Participants installed a web extension to participate in each study. In the case of the first… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/youtube_regrets.
h
tiny-imagenet
huggingface.co
datasets.activeloop.ai
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Zheng (2022). tiny-imagenet [Dataset]. https://huggingface.co/datasets/zh-plus/tiny-imagenet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2022
Authors
Hao Zheng
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
Dataset Card for tiny-imagenet

Dataset Summary

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

Languages

The class labels in the dataset are in English.

Dataset Structure Data Instances

{ 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.