Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
distilabel Orca Pairs for DPO
The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs.
Dataset Card for distilabel-reflection-tuning
This dataset has been created with distilabel. The pipeline script was uploaded to easily reproduce the dataset: reflection.py. It can be run directly using the CLI: distilabel pipeline run --script "https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning/raw/main/reflection.py"
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/distilabel-reflection-tuning.
Dataset Card for example-dataset-distilabel
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidberenstein1957/example-dataset-distilabel/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/davidberenstein1957/example-dataset-distilabel.
Dataset Card for distilabel-magpie-dataset-ray
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-dataset-ray/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/distilabel-magpie-dataset-ray.
Dataset Card for distilabel-dataset-generator-only-instructions
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions/raw/main/pipeline.yaml"
or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/distilabel-dataset-generator-only-instructions.
Dataset Card for instruction-dataset-mini-with-generations
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"
or explore the configuration:… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations.
distilabel-internal-testing/instructions dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for test1
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/test1/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-texcat-generation-dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
How this Data was made
We made this data through the following steps:
Sample English reasoning-style prompts from argilla/distilabel-reasoning-prompts. Remove similar prompts using text similarity based on BAAI/bge-m3 embeddings. Translate English prompts to Japanese using gpt-4o-mini-2024-07-18. Generate answers to prompts using deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Filter responses (to ja_valid) which did not: Finish within 2048 tokens Contain a valid
distilabel-internal-testing/instruction-dataset-mini dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for distilabel-ollama-test
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davidmeikle/distilabel-ollama-test/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/davidmeikle/distilabel-ollama-test.
Dataset Card for testing-dataset-distilabel
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/shazoo2k/testing-dataset-distilabel.
Dataset Card for DPO-distilabel-capybara-dpo-7k-binarized
Reformatted from argilla/distilabel-capybara-dpo-7k-binarized dataset. The LION-series are trained using an empirically optimized pipeline that consists of three stages: SFT, DPO, and online preference learning (online DPO). We find simple techniques such as sequence packing, loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of language… See the full description on the dataset page: https://huggingface.co/datasets/Columbia-NLP/DPO-distilabel-capybara-dpo-7k-binarized.
Dataset Card for example-retrieval-reranking-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-retrieval-reranking-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-retrieval-reranking-dataset.
Dataset Card for img-prefs-distilabel-artifacts-sample
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/img-prefs-distilabel-artifacts-sample/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/img-prefs-distilabel-artifacts-sample.
AAAA128/distilabel-example dataset hosted on Hugging Face and contributed by the HF Datasets community
davidmeikle/distilabel-example-1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for a1-preference-v1.02
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/ashercn97/a1-preference-v1.02/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/ashercn97/a1-preference-v1.02.
Dataset Card for replacing-judges-with-juries-distilabel
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/alvarobartt/replacing-judges-with-juries-distilabel/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/alvarobartt/replacing-judges-with-juries-distilabel.
Dataset Card for embeddings-dataset-paraphrase
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/embeddings-dataset-paraphrase/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/embeddings-dataset-paraphrase.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
distilabel Orca Pairs for DPO
The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with… See the full description on the dataset page: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs.