Facebook
Twittermohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterricdomolm/FormalMATH-Kimina-1.5B-Stratified dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterorpo-explorers/OHP-Stratified-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
rtriangle/PopQA-s-pop-stratified-split-results dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterorpo-explorers/OHP-15k-Stratified-3 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterdchaplinsky/wikipar-stratified dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
WildChat Stratified Sample
Dataset Description
This dataset contains a stratified sample of 263 GPT-4 conversations (347 total turns) from the WildChat dataset. The sample was carefully selected to ensure balanced representation across conversation turn positions and user message lengths.
Dataset Summary
Total Conversations: 263 Total Turns/Rows: 347 Average Turns per Conversation: 1.32 Conversation Length: 1-5 turns (conversations with >5 turns excluded)โฆ See the full description on the dataset page: https://huggingface.co/datasets/Avinaash/wildchat-stratified-sample.
Facebook
TwitterStratified K-Means Diverse Instruction-Following Dataset (100K-1M)
A carefully balanced subset combining Tulu-3 SFT Mixture and Orca AgentInstruct, featuring embedding-based k-means sampling across diverse instruction-following tasks at multiple scales.
๐ฅ Follow the Authors
Aman Priyanshu
Supriti Vijay
Overview
This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-quality instruction-following data fromโฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M.
Facebook
Twittermyyycroft/PopQA-s-pop-stratified-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterricdomolm/FormalMATH-Kimina-1.5B-Stratified-Decontaminated dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterStratified K-Means Diverse Pre-Training Dataset (100K-1M)
A carefully balanced subset combining FineWeb-Edu and Proof-Pile-2, featuring embedding-based k-means sampling to ensure diverse representation across educational and mathematical/scientific content at multiple scales.
๐ฅ Follow the Authors
Aman Priyanshu
Supriti Vijay
Overview
This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales, combining high-qualityโฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M.
Facebook
Twitterjfang/BigEarthS2-RGB-1k-stratified-balanced dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjfang/BigEarthS2-RGB-10k-stratified-balanced dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjfang/BigEarthS2-RGB-10k-stratified-min-guarantee dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Stratified K-Means Diverse Reasoning Dataset (100K-1M)
A carefully balanced subset of NVIDIA's Llama-Nemotron Post-Training Dataset, featuring square-root rebalanced sampling across math, code, science, instruction-following, chat, and safety tasks at multiple scales.
๐ฅ Follow the Authors
Aman Priyanshu
Supriti Vijay
Overview
This dataset provides stratified subsets at 50k, 100k, 250k, 500k, and 1M scales from the Llama-Nemotronโฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a direct copy of mlabonne/dolphin-r1-deepseek with extra stratification information added in post-processing. The strata include:
task_type output_length response_style domain complexity
There has been no validation done on the strata information, use at your own risk.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Stratified and Nested Subsets of AG News for Performance Benchmarking
Dataset Summary
This repository contains a script to generate stratified and progressively smaller, nested subsets of the AG News dataset. It was specifically designed to benchmark the performance (e.g., accuracy, training time, and resource usage) of language models on varying amounts of training data. This new version improves the benchmarking process by:
Preserving the Original Test Set: It uses theโฆ See the full description on the dataset page: https://huggingface.co/datasets/szhuggingface/ag_news.
Facebook
Twitterkugler/AmDi.alpha.stratified.small dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterpolyglots/punjabi-sentiment-tagger-stratified dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Forked from opendatalab/OmniDocBench.
Sampler
We have added a simple Python tool for filtering and performing stratified sampling on OmniDocBench data.
Features
Filter JSON entries based on custom criteria Perform stratified sampling based on multiple categories Handle nested JSON fields
Installation
Local Development Install (Recommended)
git clone https://huggingface.co/Quivr/OmniDocBench.git cd OmniDocBench pip install -r requirements.txt #โฆ See the full description on the dataset page: https://huggingface.co/datasets/Quivr/OmniDocBench.
Facebook
Twittermohamed-stifi/Cleaned-Darija-SFT-Mixture-6k-Stratified-Sample dataset hosted on Hugging Face and contributed by the HF Datasets community