28 datasets found
  1. h

    Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

    • huggingface.co
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohamed stifi (2025). Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample [Dataset]. https://huggingface.co/datasets/mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample
    Explore at:
    Dataset updated
    Oct 23, 2025
    Authors
    mohamed stifi
    Description

    mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    FormalMATH-Kimina-1.5B-Stratified

    • huggingface.co
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo (2025). FormalMATH-Kimina-1.5B-Stratified [Dataset]. https://huggingface.co/datasets/ricdomolm/FormalMATH-Kimina-1.5B-Stratified
    Explore at:
    Dataset updated
    Aug 11, 2025
    Authors
    Ricardo
    Description

    ricdomolm/FormalMATH-Kimina-1.5B-Stratified dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    OHP-Stratified-2

    • huggingface.co
    Updated May 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ORPO Explorers (2024). OHP-Stratified-2 [Dataset]. https://huggingface.co/datasets/orpo-explorers/OHP-Stratified-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2024
    Dataset authored and provided by
    ORPO Explorers
    Description

    orpo-explorers/OHP-Stratified-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    PopQA-s-pop-stratified-split-results

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniil Korbut, PopQA-s-pop-stratified-split-results [Dataset]. https://huggingface.co/datasets/rtriangle/PopQA-s-pop-stratified-split-results
    Explore at:
    Authors
    Daniil Korbut
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    rtriangle/PopQA-s-pop-stratified-split-results dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    OHP-15k-Stratified-3

    • huggingface.co
    Updated Apr 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ORPO Explorers (2024). OHP-15k-Stratified-3 [Dataset]. https://huggingface.co/datasets/orpo-explorers/OHP-15k-Stratified-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2024
    Dataset authored and provided by
    ORPO Explorers
    Description

    orpo-explorers/OHP-15k-Stratified-3 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    wikipar-stratified

    • huggingface.co
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmytro Chaplynskyi (2025). wikipar-stratified [Dataset]. https://huggingface.co/datasets/dchaplinsky/wikipar-stratified
    Explore at:
    Dataset updated
    Jul 7, 2025
    Authors
    Dmytro Chaplynskyi
    Description

    dchaplinsky/wikipar-stratified dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    wildchat-stratified-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anand, wildchat-stratified-sample [Dataset]. https://huggingface.co/datasets/Avinaash/wildchat-stratified-sample
    Explore at:
    Authors
    Anand
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    WildChat Stratified Sample

      Dataset Description
    

    This dataset contains a stratified sample of 263 GPT-4 conversations (347 total turns) from the WildChat dataset. The sample was carefully selected to ensure balanced representation across conversation turn positions and user message lengths.

      Dataset Summary
    

    Total Conversations: 263 Total Turns/Rows: 347 Average Turns per Conversation: 1.32 Conversation Length: 1-5 turns (conversations with >5 turns excluded)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Avinaash/wildchat-stratified-sample.

  8. h

    stratified-kmeans-diverse-instruction-following-100K-1M

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). stratified-kmeans-diverse-instruction-following-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Aman Priyanshu
    Description

    Stratified K-Means Diverse Instruction-Following Dataset (100K-1M)

    A carefully balanced subset combining Tulu-3 SFT Mixture and Orca AgentInstruct, featuring embedding-based k-means sampling across diverse instruction-following tasks at multiple scales.

      ๐Ÿ‘ฅ Follow the Authors
    

    Aman Priyanshu

    Supriti Vijay

      Overview
    

    This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality instruction-following data fromโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M.

  9. h

    PopQA-s-pop-stratified-split

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Seleznyov, PopQA-s-pop-stratified-split [Dataset]. https://huggingface.co/datasets/myyycroft/PopQA-s-pop-stratified-split
    Explore at:
    Authors
    Mikhail Seleznyov
    Description

    myyycroft/PopQA-s-pop-stratified-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    FormalMATH-Kimina-1.5B-Stratified-Decontaminated

    • huggingface.co
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo (2025). FormalMATH-Kimina-1.5B-Stratified-Decontaminated [Dataset]. https://huggingface.co/datasets/ricdomolm/FormalMATH-Kimina-1.5B-Stratified-Decontaminated
    Explore at:
    Dataset updated
    May 8, 2025
    Authors
    Ricardo
    Description

    ricdomolm/FormalMATH-Kimina-1.5B-Stratified-Decontaminated dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    stratified-kmeans-diverse-pretraining-100K-1M

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). stratified-kmeans-diverse-pretraining-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Aman Priyanshu
    Description

    Stratified K-Means Diverse Pre-Training Dataset (100K-1M)

    A carefully balanced subset combining FineWeb-Edu and Proof-Pile-2, featuring embedding-based k-means sampling to ensure diverse representation across educational and mathematical/scientific content at multiple scales.

      ๐Ÿ‘ฅ Follow the Authors
    

    Aman Priyanshu

    Supriti Vijay

      Overview
    

    This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-qualityโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M.

  12. h

    BigEarthS2-RGB-1k-stratified-balanced

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichao Fang, BigEarthS2-RGB-1k-stratified-balanced [Dataset]. https://huggingface.co/datasets/jfang/BigEarthS2-RGB-1k-stratified-balanced
    Explore at:
    Authors
    Jichao Fang
    Description

    jfang/BigEarthS2-RGB-1k-stratified-balanced dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    BigEarthS2-RGB-10k-stratified-balanced

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichao Fang, BigEarthS2-RGB-10k-stratified-balanced [Dataset]. https://huggingface.co/datasets/jfang/BigEarthS2-RGB-10k-stratified-balanced
    Explore at:
    Authors
    Jichao Fang
    Description

    jfang/BigEarthS2-RGB-10k-stratified-balanced dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    BigEarthS2-RGB-10k-stratified-min-guarantee

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichao Fang (2025). BigEarthS2-RGB-10k-stratified-min-guarantee [Dataset]. https://huggingface.co/datasets/jfang/BigEarthS2-RGB-10k-stratified-min-guarantee
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Jichao Fang
    Description

    jfang/BigEarthS2-RGB-10k-stratified-min-guarantee dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    stratified-kmeans-diverse-reasoning-100K-1M

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). stratified-kmeans-diverse-reasoning-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Aman Priyanshu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Stratified K-Means Diverse Reasoning Dataset (100K-1M)

    A carefully balanced subset of NVIDIA's Llama-Nemotron Post-Training Dataset, featuring square-root rebalanced sampling across math, code, science, instruction-following, chat, and safety tasks at multiple scales.

      ๐Ÿ‘ฅ Follow the Authors
    

    Aman Priyanshu

    Supriti Vijay

      Overview
    

    This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales from the Llama-Nemotronโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M.

  16. h

    dolphin-r1-deepseek-stratified

    • huggingface.co
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unavailable (2025). dolphin-r1-deepseek-stratified [Dataset]. https://huggingface.co/datasets/Ewere/dolphin-r1-deepseek-stratified
    Explore at:
    Dataset updated
    Jul 5, 2025
    Authors
    Unavailable
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a direct copy of mlabonne/dolphin-r1-deepseek with extra stratification information added in post-processing. The strata include:

    task_type output_length response_style domain complexity

    There has been no validation done on the strata information, use at your own risk.

  17. h

    ag_news

    • huggingface.co
    Updated Sep 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SZ (2025). ag_news [Dataset]. https://huggingface.co/datasets/szhuggingface/ag_news
    Explore at:
    Dataset updated
    Sep 14, 2025
    Authors
    SZ
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Stratified and Nested Subsets of AG News for Performance Benchmarking

      Dataset Summary
    

    This repository contains a script to generate stratified and progressively smaller, nested subsets of the AG News dataset. It was specifically designed to benchmark the performance (e.g., accuracy, training time, and resource usage) of language models on varying amounts of training data. This new version improves the benchmarking process by:

    Preserving the Original Test Set: It uses theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/szhuggingface/ag_news.

  18. h

    AmDi.alpha.stratified.small

    • huggingface.co
    Updated Feb 28, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai Kugler (2008). AmDi.alpha.stratified.small [Dataset]. https://huggingface.co/datasets/kugler/AmDi.alpha.stratified.small
    Explore at:
    Dataset updated
    Feb 28, 2008
    Authors
    Kai Kugler
    Description

    kugler/AmDi.alpha.stratified.small dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    punjabi-sentiment-tagger-stratified

    • huggingface.co
    Updated Aug 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Polyglots FYP (2025). punjabi-sentiment-tagger-stratified [Dataset]. https://huggingface.co/datasets/polyglots/punjabi-sentiment-tagger-stratified
    Explore at:
    Dataset updated
    Aug 20, 2025
    Dataset authored and provided by
    Polyglots FYP
    Description

    polyglots/punjabi-sentiment-tagger-stratified dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    OmniDocBench

    • huggingface.co
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quivr (2025). OmniDocBench [Dataset]. https://huggingface.co/datasets/Quivr/OmniDocBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2025
    Dataset authored and provided by
    Quivr
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Forked from opendatalab/OmniDocBench.

      Sampler
    

    We have added a simple Python tool for filtering and performing stratified sampling on OmniDocBench data.

      Features
    

    Filter JSON entries based on custom criteria Perform stratified sampling based on multiple categories Handle nested JSON fields

      Installation
    
    
    
    
    
      Local Development Install (Recommended)
    

    git clone https://huggingface.co/Quivr/OmniDocBench.git cd OmniDocBench pip install -r requirements.txt #โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Quivr/OmniDocBench.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
mohamed stifi (2025). Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample [Dataset]. https://huggingface.co/datasets/mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

Explore at:
Dataset updated
Oct 23, 2025
Authors
mohamed stifi
Description

mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu