87 datasets found

h
distilabel-intel-orca-dpo-pairs
huggingface.co
Updated Dec 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2024). distilabel-intel-orca-dpo-pairs [Dataset]. https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs
Explore at:
Dataset updated
Dec 11, 2024
Dataset authored and provided by
Argilla
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
distilabel Orca Pairs for DPO

The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs.
h
distilabel-reflection-tuning
huggingface.co
Updated Sep 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Martín Blázquez (2024). distilabel-reflection-tuning [Dataset]. https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2024
Authors
Gabriel Martín Blázquez
Description
Dataset Card for distilabel-reflection-tuning

This dataset has been created with distilabel. The pipeline script was uploaded to easily reproduce the dataset: reflection.py. It can be run directly using the CLI: distilabel pipeline run --script "https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning/raw/main/reflection.py"

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning.
h
example-dataset-distilabel
huggingface.co
Updated Aug 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Berenstein (2025). example-dataset-distilabel [Dataset]. https://huggingface.co/datasets/davidberenstein1957/example-dataset-distilabel
Explore at:
Dataset updated
Aug 27, 2025
Authors
David Berenstein
Description
Dataset Card for example-dataset-distilabel

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidberenstein1957/example-dataset-distilabel/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/davidberenstein1957/example-dataset-distilabel.
h
distilabel-magpie-dataset-ray
huggingface.co
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Martín Blázquez (2024). distilabel-magpie-dataset-ray [Dataset]. https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-dataset-ray
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2024
Authors
Gabriel Martín Blázquez
Description
Dataset Card for distilabel-magpie-dataset-ray

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-dataset-ray/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-dataset-ray.
h
distilabel-dataset-generator-only-instructions
huggingface.co
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Vila (2024). distilabel-dataset-generator-only-instructions [Dataset]. https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Daniel Vila
Description
Dataset Card for distilabel-dataset-generator-only-instructions

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions/raw/main/pipeline.yaml"

or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions.
h
instruction-dataset-mini-with-generations
huggingface.co
Updated Feb 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2023). instruction-dataset-mini-with-generations [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for instruction-dataset-mini-with-generations

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

or explore the configuration:… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations.
h
instructions
huggingface.co
Updated Feb 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2023). instructions [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/instructions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset authored and provided by
distilabel-internal-testing
Description
distilabel-internal-testing/instructions dataset hosted on Hugging Face and contributed by the HF Datasets community
h
example-texcat-generation-dataset
huggingface.co
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). example-texcat-generation-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-texcat-generation-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for test1

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/test1/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-texcat-generation-dataset.
h
distilabel-reasoning-R1-Llama-70B
huggingface.co
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lightblue KK. (2025). distilabel-reasoning-R1-Llama-70B [Dataset]. https://huggingface.co/datasets/lightblue/distilabel-reasoning-R1-Llama-70B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 23, 2025
Dataset authored and provided by
Lightblue KK.
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
How this Data was made

We made this data through the following steps:

Sample English reasoning-style prompts from argilla/distilabel-reasoning-prompts. Remove similar prompts using text similarity based on BAAI/bge-m3 embeddings. Translate English prompts to Japanese using gpt-4o-mini-2024-07-18. Generate answers to prompts using deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Filter responses (to ja_valid) which did not: Finish within 2048 tokens Contain a valid
h
instruction-dataset-mini
huggingface.co
Updated Feb 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2023). instruction-dataset-mini [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset authored and provided by
distilabel-internal-testing
Description
distilabel-internal-testing/instruction-dataset-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
h
distilabel-ollama-test
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meikle, distilabel-ollama-test [Dataset]. https://huggingface.co/datasets/davidmeikle/distilabel-ollama-test
Explore at:
Authors
David Meikle
Description
Dataset Card for distilabel-ollama-test

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidmeikle/distilabel-ollama-test/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/davidmeikle/distilabel-ollama-test.
h
testing-dataset-distilabel
huggingface.co
Updated Sep 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahzaib Niaz (2024). testing-dataset-distilabel [Dataset]. https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2024
Authors
Shahzaib Niaz
Description
Dataset Card for testing-dataset-distilabel

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel.
h
DPO-distilabel-capybara-dpo-7k-binarized
huggingface.co
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Columbia NLP (2024). DPO-distilabel-capybara-dpo-7k-binarized [Dataset]. https://huggingface.co/datasets/Columbia-NLP/DPO-distilabel-capybara-dpo-7k-binarized
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2024
Dataset authored and provided by
Columbia NLP
Description
Dataset Card for DPO-distilabel-capybara-dpo-7k-binarized

Reformatted from argilla/distilabel-capybara-dpo-7k-binarized dataset. The LION-series are trained using an empirically optimized pipeline that consists of three stages: SFT, DPO, and online preference learning (online DPO). We find simple techniques such as sequence packing, loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of language… See the full description on the dataset page: https://huggingface.co/datasets/Columbia-NLP/DPO-distilabel-capybara-dpo-7k-binarized.
h
example-retrieval-reranking-dataset
huggingface.co
Updated Aug 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). example-retrieval-reranking-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-retrieval-reranking-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for example-retrieval-reranking-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-retrieval-reranking-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-retrieval-reranking-dataset.
h
img-prefs-distilabel-artifacts-sample
huggingface.co
Updated Sep 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Vila (2024). img-prefs-distilabel-artifacts-sample [Dataset]. https://huggingface.co/datasets/dvilasuero/img-prefs-distilabel-artifacts-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2024
Authors
Daniel Vila
Description
Dataset Card for img-prefs-distilabel-artifacts-sample

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/img-prefs-distilabel-artifacts-sample/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/img-prefs-distilabel-artifacts-sample.
h
distilabel-example
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A, distilabel-example [Dataset]. https://huggingface.co/datasets/AAAA128/distilabel-example
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
A
Description
AAAA128/distilabel-example dataset hosted on Hugging Face and contributed by the HF Datasets community
h
distilabel-example-1
huggingface.co
Updated Jul 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meikle (2025). distilabel-example-1 [Dataset]. https://huggingface.co/datasets/davidmeikle/distilabel-example-1
Explore at:
Dataset updated
Jul 22, 2025
Authors
David Meikle
Description
davidmeikle/distilabel-example-1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
a1-preference-v1.02
huggingface.co
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ash C (2025). a1-preference-v1.02 [Dataset]. https://huggingface.co/datasets/ashercn97/a1-preference-v1.02
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 7, 2025
Authors
Ash C
Description
Dataset Card for a1-preference-v1.02

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/ashercn97/a1-preference-v1.02/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/ashercn97/a1-preference-v1.02.
h
replacing-judges-with-juries-distilabel
huggingface.co
Updated Sep 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alvaro Bartolome (2024). replacing-judges-with-juries-distilabel [Dataset]. https://huggingface.co/datasets/alvarobartt/replacing-judges-with-juries-distilabel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2024
Authors
Alvaro Bartolome
Description
Dataset Card for replacing-judges-with-juries-distilabel

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/alvarobartt/replacing-judges-with-juries-distilabel/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/alvarobartt/replacing-judges-with-juries-distilabel.
h
embeddings-dataset-paraphrase
huggingface.co
Updated Jun 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). embeddings-dataset-paraphrase [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/embeddings-dataset-paraphrase
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for embeddings-dataset-paraphrase

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/embeddings-dataset-paraphrase/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/embeddings-dataset-paraphrase.

Facebook

Twitter

Click to copy link

Link copied

Cite

Argilla (2024). distilabel-intel-orca-dpo-pairs [Dataset]. https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs

distilabel-intel-orca-dpo-pairs

argilla/distilabel-intel-orca-dpo-pairs

Explore at:

18 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 11, 2024

Dataset authored and provided by

Argilla

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

distilabel Orca Pairs for DPO

The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs.

Clear search

Close search

Google apps

Main menu

distilabel-intel-orca-dpo-pairs

distilabel-reflection-tuning

example-dataset-distilabel

distilabel-magpie-dataset-ray

distilabel-dataset-generator-only-instructions

instruction-dataset-mini-with-generations

instructions

example-texcat-generation-dataset

distilabel-reasoning-R1-Llama-70B

instruction-dataset-mini

distilabel-ollama-test

testing-dataset-distilabel

DPO-distilabel-capybara-dpo-7k-binarized

example-retrieval-reranking-dataset

img-prefs-distilabel-artifacts-sample

distilabel-example

distilabel-example-1

a1-preference-v1.02

replacing-judges-with-juries-distilabel

embeddings-dataset-paraphrase

distilabel-intel-orca-dpo-pairsSee More Versions

argilla/distilabel-intel-orca-dpo-pairs

distilabel-intel-orca-dpo-pairs