Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Tiny English
A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.
arcee-ai/sec-data-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Dataset Card for tiny-imagenet
Dataset Summary
Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.
Languages
The class labels in the dataset are in English.
Dataset Structure
Data Instances
{ 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BAAI_bge-small-en-v1_5-02082024-vrdv-webapp Dataset
Dataset Description
The dataset "general domain" is a generated dataset designed to support the development of domain specific embedding models for retrieval tasks.
Associated Model
This dataset was used to train the BAAI_bge-small-en-v1_5-02082024-vrdv-webapp model.
How to Use
To use this dataset for model training or evaluation, you can load it using the Hugging Face datasets library as follows:… See the full description on the dataset page: https://huggingface.co/datasets/fine-tuned/BAAI_bge-small-en-v1_5-02082024-vrdv-webapp.
CoVLA-Dataset-Mini
Dataset description
CoVLA-Dataset-Mini is a subset of the CoVLA-Dataset (Comprehensive Vision-Language Action), containing data from 50 scenes. CoVLA-Dataset is an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language… See the full description on the dataset page: https://huggingface.co/datasets/turing-motors/CoVLA-Dataset-Mini.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Description
A mini version of ImageNet-1k with 100 of 1000 classes present. Unlike some 'mini' variants this one includes the original images at their original sizes. Many such subsets downsample to 84x84 or other smaller resolutions.
Data Splits
Train
50000 samples from ImageNet-1k train split
Validation
10000 samples from ImageNet-1k train split
Test
5000 samples from ImageNet-1k validation split (all 50 samples per class)… See the full description on the dataset page: https://huggingface.co/datasets/timm/mini-imagenet.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Textbook-like Dataset: A High-Quality Resource for Small Language Models
The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
faisalq/SFC-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
RIW/small-coco-wm_1_120k dataset hosted on Hugging Face and contributed by the HF Datasets community
MAIR-Bench/MAIR-Results-text-embedding-3-small dataset hosted on Hugging Face and contributed by the HF Datasets community
testcase-evaluate/all-gpt-4.1-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset summary
This dataset is designed to assist in predicting a customer's propensity to purchase various products within a month following the reporting date. The dataset includes anonymized historical data on transaction activity, dialog embeddings, and geo-activity for some bank clients over 12 months. The mini MBD dataset contains a reduced subset of the data, making it easier and faster to work with during the development and testing phases. It includes a smaller number… See the full description on the dataset page: https://huggingface.co/datasets/ai-lab/MBD-mini.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for DIALOGSum Corpus
Dataset Description
Links
Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick
Dataset Summary
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
fabnem/mini-mmc4 dataset hosted on Hugging Face and contributed by the HF Datasets community
loubnabnl/github-code-clean-small dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Tiny English
A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.