Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
distilabel Orca Pairs for DPO
The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Capybara-DPO 7K binarized
A DPO dataset built with distilabel atop the awesome LDJnr/Capybara
This is a preview version to collect feedback from the community. v2 will include the full base dataset and responses from more powerful models.
Why?
Multi-turn dialogue data is key to fine-tune capable chat models. Multi-turn preference data has been used by the most relevant RLHF works (Anthropic, Meta Llama2, etc.). Unfortunately, there are very few… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized.
Dataset Card for distilabel-dataset-generator-only-instructions
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions/raw/main/pipeline.yaml"
or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions.
Dataset Card for distilabel-example4
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/archit11/distilabel-example4/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/archit11/distilabel-example4.
Dataset Card for distilabel-reflection-tuning
This dataset has been created with distilabel. The pipeline script was uploaded to easily reproduce the dataset: reflection.py. It can be run directly using the CLI: distilabel pipeline run --script "https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning/raw/main/reflection.py"
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
distilabel Orca Pairs for KTO
A KTO signal transformed version of the highly loved distilabel Orca Pairs for DPO.
The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-kto.
distilabel-internal-testing/instruction-dataset-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for distilabel-instruction-to-preference-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Mervyn999/distilabel-instruction-to-preference-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/Mervyn999/distilabel-instruction-to-preference-dataset.
argilla/distilabel-sample-evol-instruct dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for distilabel-magpie-math
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-math/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-math.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
How this Data was made
We made this data through the following steps:
Sample English reasoning-style prompts from argilla/distilabel-reasoning-prompts. Remove similar prompts using text similarity based on BAAI/bge-m3 embeddings. Translate English prompts to Japanese using gpt-4o-mini-2024-07-18. Generate answers to prompts using deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Filter responses (to ja_valid) which did not: Finish within 2048 tokens Contain a valid
Dataset Card for distilabel-example-test
This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into your Argilla server as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.
Using this dataset with Argilla
To load with Argilla, you'll just need to install Argilla as pip install argilla --upgrade and then use the following code: import argilla as rg
ds =… See the full description on the dataset page: https://huggingface.co/datasets/thomwolf/distilabel-example-test.
Dataset Card for instruction-dataset-with-llama3
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-with-llama3/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-with-llama3.
Dataset Card for instruction-dataset-mini-with-generations
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"
or explore the configuration:… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations.
Dataset Card for inference-endpoints-structured-generation-multiple
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/inference-endpoints-structured-generation-multiple/raw/main/pipeline.yaml"
or explore the… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/inference-endpoints-structured-generation-multiple.
m-newhauser/rag-synthetic-distilabel dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for distilabel-artifacts-example
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example.
Dataset Card for preferance-dataset-with-distilabel
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/riteshkr/preferance-dataset-with-distilabel/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/riteshkr/preferance-dataset-with-distilabel.
distilabel-internal-testing/instruction-dataset-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for testing-dataset-distilabel
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
distilabel Orca Pairs for DPO
The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs.