100+ datasets found
  1. h

    mini-en

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2025). mini-en [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-en
    Explore at:
    Dataset updated
    Mar 28, 2025
    Authors
    Nam Pham
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Tiny English

    A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.

  2. sec-data-mini

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arcee AI, sec-data-mini [Dataset]. https://huggingface.co/datasets/arcee-ai/sec-data-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Arcee AI, Inc.
    Authors
    Arcee AI
    Description

    arcee-ai/sec-data-mini dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  4. h

    rag-mini-wikipedia

    • huggingface.co
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  5. h

    tiny-imagenet

    • huggingface.co
    • datasets.activeloop.ai
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Zheng (2022). tiny-imagenet [Dataset]. https://huggingface.co/datasets/zh-plus/tiny-imagenet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2022
    Authors
    Hao Zheng
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    Dataset Card for tiny-imagenet

      Dataset Summary
    

    Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.

      Languages
    

    The class labels in the dataset are in English.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    { 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190, 'label': 15 }… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.

  6. h

    BAAI_bge-small-en-v1_5-02082024-vrdv-webapp

    • huggingface.co
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fine-tuned Embeddings (2024). BAAI_bge-small-en-v1_5-02082024-vrdv-webapp [Dataset]. https://huggingface.co/datasets/fine-tuned/BAAI_bge-small-en-v1_5-02082024-vrdv-webapp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2024
    Dataset authored and provided by
    Fine-tuned Embeddings
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    BAAI_bge-small-en-v1_5-02082024-vrdv-webapp Dataset

      Dataset Description
    

    The dataset "general domain" is a generated dataset designed to support the development of domain specific embedding models for retrieval tasks.

      Associated Model
    

    This dataset was used to train the BAAI_bge-small-en-v1_5-02082024-vrdv-webapp model.

      How to Use
    

    To use this dataset for model training or evaluation, you can load it using the Hugging Face datasets library as follows:… See the full description on the dataset page: https://huggingface.co/datasets/fine-tuned/BAAI_bge-small-en-v1_5-02082024-vrdv-webapp.

  7. h

    CoVLA-Dataset-Mini

    • huggingface.co
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turing Inc. (2024). CoVLA-Dataset-Mini [Dataset]. https://huggingface.co/datasets/turing-motors/CoVLA-Dataset-Mini
    Explore at:
    Dataset updated
    Aug 21, 2024
    Dataset authored and provided by
    Turing Inc.
    Description

    CoVLA-Dataset-Mini

      Dataset description
    

    CoVLA-Dataset-Mini is a subset of the CoVLA-Dataset (Comprehensive Vision-Language Action), containing data from 50 scenes. CoVLA-Dataset is an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language… See the full description on the dataset page: https://huggingface.co/datasets/turing-motors/CoVLA-Dataset-Mini.

  8. h

    mini-imagenet

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyTorch Image Models (2024). mini-imagenet [Dataset]. https://huggingface.co/datasets/timm/mini-imagenet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    PyTorch Image Models
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Description

    A mini version of ImageNet-1k with 100 of 1000 classes present. Unlike some 'mini' variants this one includes the original images at their original sizes. Many such subsets downsample to 84x84 or other smaller resolutions.

      Data Splits
    
    
    
    
    
      Train
    

    50000 samples from ImageNet-1k train split

      Validation
    

    10000 samples from ImageNet-1k train split

      Test
    

    5000 samples from ImageNet-1k validation split (all 50 samples per class)… See the full description on the dataset page: https://huggingface.co/datasets/timm/mini-imagenet.

  9. h

    TinyStories

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2023
    Authors
    Ronen Eldan
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

  10. h

    tiny-textbooks

    • huggingface.co
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2024). tiny-textbooks [Dataset]. http://doi.org/10.57967/hf/1126
    Explore at:
    Dataset updated
    Jan 26, 2024
    Authors
    Nam Pham
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Textbook-like Dataset: A High-Quality Resource for Small Language Models

    The idea is simply inspired by the Textbooks Are All You Need II: phi-1.5 technical report paper. The source texts in this dataset have been gathered and carefully select the best of the falcon-refinedweb and minipile datasets to ensure the diversity, quality while tiny in size. The dataset was synthesized using 4x3090 Ti cards over a period of 500 hours, thanks to Nous-Hermes-Llama2-13b finetuned model. Why… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks.

  11. h

    SFC-mini

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faisal Qarah (2024). SFC-mini [Dataset]. https://huggingface.co/datasets/faisalq/SFC-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Faisal Qarah
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    faisalq/SFC-mini dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. ultrachat_200k

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Oct 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for UltraChat 200k

      Dataset Description
    

    This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

    Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.

  13. h

    small-coco-wm_1_120k

    • huggingface.co
    Updated Feb 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RIW (2023). small-coco-wm_1_120k [Dataset]. https://huggingface.co/datasets/RIW/small-coco-wm_1_120k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2023
    Dataset authored and provided by
    RIW
    Description

    RIW/small-coco-wm_1_120k dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    MAIR-Results-text-embedding-3-small

    • huggingface.co
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MAIR-Bench (2024). MAIR-Results-text-embedding-3-small [Dataset]. https://huggingface.co/datasets/MAIR-Bench/MAIR-Results-text-embedding-3-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2024
    Authors
    MAIR-Bench
    Description

    MAIR-Bench/MAIR-Results-text-embedding-3-small dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    all-gpt-4.1-mini

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    testcase-eval, all-gpt-4.1-mini [Dataset]. https://huggingface.co/datasets/testcase-evaluate/all-gpt-4.1-mini
    Explore at:
    Dataset authored and provided by
    testcase-eval
    Description

    testcase-evaluate/all-gpt-4.1-mini dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    MBD-mini

    • huggingface.co
    Updated Aug 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sb-ai-lab (2024). MBD-mini [Dataset]. https://huggingface.co/datasets/ai-lab/MBD-mini
    Explore at:
    Dataset updated
    Aug 22, 2024
    Dataset authored and provided by
    sb-ai-lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset summary

    This dataset is designed to assist in predicting a customer's propensity to purchase various products within a month following the reporting date. The dataset includes anonymized historical data on transaction activity, dialog embeddings, and geo-activity for some bank clients over 12 months. The mini MBD dataset contains a reduced subset of the data, making it easier and faster to work with during the development and testing phases. It includes a smaller number… See the full description on the dataset page: https://huggingface.co/datasets/ai-lab/MBD-mini.

  17. h

    dialogsum

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Karthick Kaliannan Neelamohan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for DIALOGSum Corpus

      Dataset Description
    
    
    
    
    
      Links
    

    Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

      Dataset Summary
    

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

  18. h

    mini-mmc4

    • huggingface.co
    Updated Oct 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrice Nemo (2024). mini-mmc4 [Dataset]. https://huggingface.co/datasets/fabnem/mini-mmc4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2024
    Authors
    Fabrice Nemo
    Description

    fabnem/mini-mmc4 dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    github-code-clean-small

    • huggingface.co
    Updated Sep 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loubna Ben Allal (2022). github-code-clean-small [Dataset]. https://huggingface.co/datasets/loubnabnl/github-code-clean-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2022
    Authors
    Loubna Ben Allal
    Description

    loubnabnl/github-code-clean-small dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nam Pham (2025). mini-en [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-en

mini-en

Tiny English

nampdn-ai/mini-en

Explore at:
Dataset updated
Mar 28, 2025
Authors
Nam Pham
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Tiny English

A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.

Search
Clear search
Close search
Google apps
Main menu