28 datasets found

h
Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample
huggingface.co
Updated Oct 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mohamed stifi (2025). Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample [Dataset]. https://huggingface.co/datasets/mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample
Explore at:
Dataset updated
Oct 23, 2025
Authors
mohamed stifi
Description
mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample dataset hosted on Hugging Face and contributed by the HF Datasets community
h
FormalMATH-Kimina-1.5B-Stratified
huggingface.co
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo (2025). FormalMATH-Kimina-1.5B-Stratified [Dataset]. https://huggingface.co/datasets/ricdomolm/FormalMATH-Kimina-1.5B-Stratified
Explore at:
Dataset updated
Aug 11, 2025
Authors
Ricardo
Description
ricdomolm/FormalMATH-Kimina-1.5B-Stratified dataset hosted on Hugging Face and contributed by the HF Datasets community
h
OHP-Stratified-2
huggingface.co
Updated May 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ORPO Explorers (2024). OHP-Stratified-2 [Dataset]. https://huggingface.co/datasets/orpo-explorers/OHP-Stratified-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2024
Dataset authored and provided by
ORPO Explorers
Description
orpo-explorers/OHP-Stratified-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
PopQA-s-pop-stratified-split-results
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniil Korbut, PopQA-s-pop-stratified-split-results [Dataset]. https://huggingface.co/datasets/rtriangle/PopQA-s-pop-stratified-split-results
Explore at:
Authors
Daniil Korbut
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
rtriangle/PopQA-s-pop-stratified-split-results dataset hosted on Hugging Face and contributed by the HF Datasets community
h
OHP-15k-Stratified-3
huggingface.co
Updated Apr 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ORPO Explorers (2024). OHP-15k-Stratified-3 [Dataset]. https://huggingface.co/datasets/orpo-explorers/OHP-15k-Stratified-3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2024
Dataset authored and provided by
ORPO Explorers
Description
orpo-explorers/OHP-15k-Stratified-3 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikipar-stratified
huggingface.co
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmytro Chaplynskyi (2025). wikipar-stratified [Dataset]. https://huggingface.co/datasets/dchaplinsky/wikipar-stratified
Explore at:
Dataset updated
Jul 7, 2025
Authors
Dmytro Chaplynskyi
Description
dchaplinsky/wikipar-stratified dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wildchat-stratified-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anand, wildchat-stratified-sample [Dataset]. https://huggingface.co/datasets/Avinaash/wildchat-stratified-sample
Explore at:
Authors
Anand
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
WildChat Stratified Sample

Dataset Description

This dataset contains a stratified sample of 263 GPT-4 conversations (347 total turns) from the WildChat dataset. The sample was carefully selected to ensure balanced representation across conversation turn positions and user message lengths.

Dataset Summary

Total Conversations: 263 Total Turns/Rows: 347 Average Turns per Conversation: 1.32 Conversation Length: 1-5 turns (conversations with >5 turns excluded)… See the full description on the dataset page: https://huggingface.co/datasets/Avinaash/wildchat-stratified-sample.
h
stratified-kmeans-diverse-instruction-following-100K-1M
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). stratified-kmeans-diverse-instruction-following-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
Explore at:
Dataset updated
Oct 8, 2025
Authors
Aman Priyanshu
Description
Stratified K-Means Diverse Instruction-Following Dataset (100K-1M)

A carefully balanced subset combining Tulu-3 SFT Mixture and Orca AgentInstruct, featuring embedding-based k-means sampling across diverse instruction-following tasks at multiple scales.

👥 Follow the Authors

Aman Priyanshu

Supriti Vijay

Overview

This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality instruction-following data from… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M.
h
PopQA-s-pop-stratified-split
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Seleznyov, PopQA-s-pop-stratified-split [Dataset]. https://huggingface.co/datasets/myyycroft/PopQA-s-pop-stratified-split
Explore at:
Authors
Mikhail Seleznyov
Description
myyycroft/PopQA-s-pop-stratified-split dataset hosted on Hugging Face and contributed by the HF Datasets community
h
FormalMATH-Kimina-1.5B-Stratified-Decontaminated
huggingface.co
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo (2025). FormalMATH-Kimina-1.5B-Stratified-Decontaminated [Dataset]. https://huggingface.co/datasets/ricdomolm/FormalMATH-Kimina-1.5B-Stratified-Decontaminated
Explore at:
Dataset updated
May 8, 2025
Authors
Ricardo
Description
ricdomolm/FormalMATH-Kimina-1.5B-Stratified-Decontaminated dataset hosted on Hugging Face and contributed by the HF Datasets community
h
stratified-kmeans-diverse-pretraining-100K-1M
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). stratified-kmeans-diverse-pretraining-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M
Explore at:
Dataset updated
Oct 8, 2025
Authors
Aman Priyanshu
Description
Stratified K-Means Diverse Pre-Training Dataset (100K-1M)

A carefully balanced subset combining FineWeb-Edu and Proof-Pile-2, featuring embedding-based k-means sampling to ensure diverse representation across educational and mathematical/scientific content at multiple scales.

👥 Follow the Authors

Aman Priyanshu

Supriti Vijay

Overview

This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M.
h
BigEarthS2-RGB-1k-stratified-balanced
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jichao Fang, BigEarthS2-RGB-1k-stratified-balanced [Dataset]. https://huggingface.co/datasets/jfang/BigEarthS2-RGB-1k-stratified-balanced
Explore at:
Authors
Jichao Fang
Description
jfang/BigEarthS2-RGB-1k-stratified-balanced dataset hosted on Hugging Face and contributed by the HF Datasets community
h
BigEarthS2-RGB-10k-stratified-balanced
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jichao Fang, BigEarthS2-RGB-10k-stratified-balanced [Dataset]. https://huggingface.co/datasets/jfang/BigEarthS2-RGB-10k-stratified-balanced
Explore at:
Authors
Jichao Fang
Description
jfang/BigEarthS2-RGB-10k-stratified-balanced dataset hosted on Hugging Face and contributed by the HF Datasets community
h
BigEarthS2-RGB-10k-stratified-min-guarantee
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jichao Fang (2025). BigEarthS2-RGB-10k-stratified-min-guarantee [Dataset]. https://huggingface.co/datasets/jfang/BigEarthS2-RGB-10k-stratified-min-guarantee
Explore at:
Dataset updated
Jul 27, 2025
Authors
Jichao Fang
Description
jfang/BigEarthS2-RGB-10k-stratified-min-guarantee dataset hosted on Hugging Face and contributed by the HF Datasets community
h
stratified-kmeans-diverse-reasoning-100K-1M
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). stratified-kmeans-diverse-reasoning-100K-1M [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M
Explore at:
Dataset updated
Oct 8, 2025
Authors
Aman Priyanshu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Stratified K-Means Diverse Reasoning Dataset (100K-1M)

A carefully balanced subset of NVIDIA's Llama-Nemotron Post-Training Dataset, featuring square-root rebalanced sampling across math, code, science, instruction-following, chat, and safety tasks at multiple scales.

👥 Follow the Authors

Aman Priyanshu

Supriti Vijay

Overview

This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales from the Llama-Nemotron… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M.
h
dolphin-r1-deepseek-stratified
huggingface.co
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unavailable (2025). dolphin-r1-deepseek-stratified [Dataset]. https://huggingface.co/datasets/Ewere/dolphin-r1-deepseek-stratified
Explore at:
Dataset updated
Jul 5, 2025
Authors
Unavailable
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a direct copy of mlabonne/dolphin-r1-deepseek with extra stratification information added in post-processing. The strata include:

task_type output_length response_style domain complexity

There has been no validation done on the strata information, use at your own risk.
h
ag_news
huggingface.co
Updated Sep 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SZ (2025). ag_news [Dataset]. https://huggingface.co/datasets/szhuggingface/ag_news
Explore at:
Dataset updated
Sep 14, 2025
Authors
SZ
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Stratified and Nested Subsets of AG News for Performance Benchmarking

Dataset Summary

This repository contains a script to generate stratified and progressively smaller, nested subsets of the AG News dataset. It was specifically designed to benchmark the performance (e.g., accuracy, training time, and resource usage) of language models on varying amounts of training data. This new version improves the benchmarking process by:

Preserving the Original Test Set: It uses the… See the full description on the dataset page: https://huggingface.co/datasets/szhuggingface/ag_news.
h
AmDi.alpha.stratified.small
huggingface.co
Updated Feb 28, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai Kugler (2008). AmDi.alpha.stratified.small [Dataset]. https://huggingface.co/datasets/kugler/AmDi.alpha.stratified.small
Explore at:
Dataset updated
Feb 28, 2008
Authors
Kai Kugler
Description
kugler/AmDi.alpha.stratified.small dataset hosted on Hugging Face and contributed by the HF Datasets community
h
punjabi-sentiment-tagger-stratified
huggingface.co
Updated Aug 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polyglots FYP (2025). punjabi-sentiment-tagger-stratified [Dataset]. https://huggingface.co/datasets/polyglots/punjabi-sentiment-tagger-stratified
Explore at:
Dataset updated
Aug 20, 2025
Dataset authored and provided by
Polyglots FYP
Description
polyglots/punjabi-sentiment-tagger-stratified dataset hosted on Hugging Face and contributed by the HF Datasets community
h
OmniDocBench
huggingface.co
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quivr (2025). OmniDocBench [Dataset]. https://huggingface.co/datasets/Quivr/OmniDocBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 31, 2025
Dataset authored and provided by
Quivr
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Forked from opendatalab/OmniDocBench.

Sampler

We have added a simple Python tool for filtering and performing stratified sampling on OmniDocBench data.

Features

Filter JSON entries based on custom criteria Perform stratified sampling based on multiple categories Handle nested JSON fields

Installation Local Development Install (Recommended)

git clone https://huggingface.co/Quivr/OmniDocBench.git cd OmniDocBench pip install -r requirements.txt #… See the full description on the dataset page: https://huggingface.co/datasets/Quivr/OmniDocBench.

Facebook

Twitter

Click to copy link

Link copied

Cite

mohamed stifi (2025). Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample [Dataset]. https://huggingface.co/datasets/mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

Explore at:

Dataset updated

Oct 23, 2025

Authors

mohamed stifi

Description

mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

FormalMATH-Kimina-1.5B-Stratified

OHP-Stratified-2

PopQA-s-pop-stratified-split-results

OHP-15k-Stratified-3

wikipar-stratified

wildchat-stratified-sample

stratified-kmeans-diverse-instruction-following-100K-1M

PopQA-s-pop-stratified-split

FormalMATH-Kimina-1.5B-Stratified-Decontaminated

stratified-kmeans-diverse-pretraining-100K-1M

BigEarthS2-RGB-1k-stratified-balanced

BigEarthS2-RGB-10k-stratified-balanced

BigEarthS2-RGB-10k-stratified-min-guarantee

stratified-kmeans-diverse-reasoning-100K-1M

dolphin-r1-deepseek-stratified

ag_news

AmDi.alpha.stratified.small

punjabi-sentiment-tagger-stratified

OmniDocBench

Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample

mohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample