100+ datasets found

h
mini-en
huggingface.co
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2025). mini-en [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-en
Explore at:
Dataset updated
Mar 28, 2025
Authors
Nam Pham
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Tiny English

A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.
sec-data-mini
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arcee AI, sec-data-mini [Dataset]. https://huggingface.co/datasets/arcee-ai/sec-data-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Arcee AI, Inc.
Authors
Arcee AI
Description
arcee-ai/sec-data-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
smollm-corpus
huggingface.co
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
h
rag-mini-wikipedia
huggingface.co
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2025
Dataset authored and provided by
RAG Datasets
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
h
tiny-imagenet
huggingface.co
datasets.activeloop.ai
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Zheng (2022). tiny-imagenet [Dataset]. https://huggingface.co/datasets/zh-plus/tiny-imagenet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2022
Authors
Hao Zheng
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
Dataset Card for tiny-imagenet

Dataset Summary

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

Languages

The class labels in the dataset are in English.

Dataset Structure Data Instances

{ 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.
h
BAAI_bge-small-en-v1_5-02082024-vrdv-webapp
huggingface.co
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fine-tuned Embeddings (2024). BAAI_bge-small-en-v1_5-02082024-vrdv-webapp [Dataset]. https://huggingface.co/datasets/fine-tuned/BAAI_bge-small-en-v1_5-02082024-vrdv-webapp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2024
Dataset authored and provided by
Fine-tuned Embeddings
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
BAAI_bge-small-en-v1_5-02082024-vrdv-webapp Dataset

Dataset Description

The dataset "general domain" is a generated dataset designed to support the development of domain specific embedding models for retrieval tasks.

Associated Model

This dataset was used to train the BAAI_bge-small-en-v1_5-02082024-vrdv-webapp model.

How to Use

To use this dataset for model training or evaluation, you can load it using the Hugging Face datasets library as follows:… See the full description on the dataset page: https://huggingface.co/datasets/fine-tuned/BAAI_bge-small-en-v1_5-02082024-vrdv-webapp.
h
CoVLA-Dataset-Mini
huggingface.co
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turing Inc. (2024). CoVLA-Dataset-Mini [Dataset]. https://huggingface.co/datasets/turing-motors/CoVLA-Dataset-Mini
Explore at:
Dataset updated
Aug 21, 2024
Dataset authored and provided by
Turing Inc.
Description
CoVLA-Dataset-Mini

Dataset description

CoVLA-Dataset-Mini is a subset of the CoVLA-Dataset (Comprehensive Vision-Language Action), containing data from 50 scenes. CoVLA-Dataset is an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language… See the full description on the dataset page: https://huggingface.co/datasets/turing-motors/CoVLA-Dataset-Mini.
h
mini-imagenet
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyTorch Image Models (2024). mini-imagenet [Dataset]. https://huggingface.co/datasets/timm/mini-imagenet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset authored and provided by
PyTorch Image Models
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Description

A mini version of ImageNet-1k with 100 of 1000 classes present. Unlike some 'mini' variants this one includes the original images at their original sizes. Many such subsets downsample to 84x84 or other smaller resolutions.

Data Splits Train

50000 samples from ImageNet-1k train split

Validation

10000 samples from ImageNet-1k train split

Test

5000 samples from ImageNet-1k validation split (all 50 samples per class)… See the full description on the dataset page: https://huggingface.co/datasets/timm/mini-imagenet.
h
TinyStories
huggingface.co
paperswithcode.com
+1more
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Ronen Eldan
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
h
tiny-textbooks
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2024). tiny-textbooks [Dataset]. http://doi.org/10.57967/hf/1126
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1126
Dataset updated
Jan 26, 2024
Authors
Nam Pham
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Textbook-like Dataset: A High-Quality Resource for Small Language Models

The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.
h
SFC-mini
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faisal Qarah (2024). SFC-mini [Dataset]. https://huggingface.co/datasets/faisalq/SFC-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Authors
Faisal Qarah
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
faisalq/SFC-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
ultrachat_200k
huggingface.co
opendatalab.com
+1more
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
h
small-coco-wm_1_120k
huggingface.co
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RIW (2023). small-coco-wm_1_120k [Dataset]. https://huggingface.co/datasets/RIW/small-coco-wm_1_120k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2023
Dataset authored and provided by
RIW
Description
RIW/small-coco-wm_1_120k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MAIR-Results-text-embedding-3-small
huggingface.co
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MAIR-Bench (2024). MAIR-Results-text-embedding-3-small [Dataset]. https://huggingface.co/datasets/MAIR-Bench/MAIR-Results-text-embedding-3-small
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2024
Authors
MAIR-Bench
Description
MAIR-Bench/MAIR-Results-text-embedding-3-small dataset hosted on Hugging Face and contributed by the HF Datasets community
h
all-gpt-4.1-mini
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
testcase-eval, all-gpt-4.1-mini [Dataset]. https://huggingface.co/datasets/testcase-evaluate/all-gpt-4.1-mini
Explore at:
Dataset authored and provided by
testcase-eval
Description
testcase-evaluate/all-gpt-4.1-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MBD-mini
huggingface.co
Updated Aug 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sb-ai-lab (2024). MBD-mini [Dataset]. https://huggingface.co/datasets/ai-lab/MBD-mini
Explore at:
Dataset updated
Aug 22, 2024
Dataset authored and provided by
sb-ai-lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset summary

This dataset is designed to assist in predicting a customer's propensity to purchase various products within a month following the reporting date. The dataset includes anonymized historical data on transaction activity, dialog embeddings, and geo-activity for some bank clients over 12 months. The mini MBD dataset contains a reduced subset of the data, making it easier and faster to work with during the development and testing phases. It includes a smaller number… See the full description on the dataset page: https://huggingface.co/datasets/ai-lab/MBD-mini.
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
h
mini-mmc4
huggingface.co
Updated Oct 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabrice Nemo (2024). mini-mmc4 [Dataset]. https://huggingface.co/datasets/fabnem/mini-mmc4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 11, 2024
Authors
Fabrice Nemo
Description
fabnem/mini-mmc4 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
github-code-clean-small
huggingface.co
Updated Sep 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loubna Ben Allal (2022). github-code-clean-small [Dataset]. https://huggingface.co/datasets/loubnabnl/github-code-clean-small
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 16, 2022
Authors
Loubna Ben Allal
Description
loubnabnl/github-code-clean-small dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nam Pham (2025). mini-en [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-en

mini-en

Tiny English

nampdn-ai/mini-en

Explore at:

Dataset updated

Mar 28, 2025

Authors

Nam Pham

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Tiny English

A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.

Clear search

Close search

Google apps

Main menu

mini-en

sec-data-mini

smollm-corpus

rag-mini-wikipedia

tiny-imagenet

BAAI_bge-small-en-v1_5-02082024-vrdv-webapp

CoVLA-Dataset-Mini

mini-imagenet

TinyStories

tiny-textbooks

SFC-mini

ultrachat_200k

small-coco-wm_1_120k

MAIR-Results-text-embedding-3-small

all-gpt-4.1-mini

MBD-mini

dialogsum

mini-mmc4

github-code-clean-small

fineweb

mini-en

Tiny English

nampdn-ai/mini-en