Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
[!NOTE] We have released a paper for OpenThoughts! See our paper here.
Open-Thoughts-114k
Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.
Available Subsets
default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
paper | dataset | model
[!NOTE] We have released a paper for OpenThoughts! See our paper here.
OpenThoughts3-1.2M
Open-source state-of-the-art reasoning dataset with 1.2M rows. 🚀 OpenThoughts3-1.2M is the third iteration in our line of OpenThoughts datasets, building on our previous OpenThoughts-114k and OpenThoughts2-1M. This time around, we scale even further and generate our dataset in a much more systematic way -- OpenThoughts3-1.2M is the result of a… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
[!NOTE] We have released a paper for OpenThoughts! See our paper here.
OpenThoughts2-1M
Open synthetic reasoning dataset with 1M high-quality examples covering math, science, code, and puzzles! OpenThoughts2-1M builds upon our previous OpenThoughts-114k dataset, augmenting it with existing datasets like OpenR1, as well as additional math and code reasoning data. This dataset was used to train OpenThinker2-7B and OpenThinker2-32B. Inspect the content with rich… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M.
This repository contains the dataset used in the paper I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. Code: https://github.com/AIRI-Institute/SAE-Reasoning
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is used for training Sparse Autoencoders (SAEs) to identify reasoning features in Large Language Models (LLMs), as described in the paper I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. Code for the paper is available at: https://github.com/AIRI-Institute/SAE-Reasoning The dataset consists of tokenized text data used for training the SAEs. dataset_info: features:
name: tokens sequence: int64 splits: name:… See the full description on the dataset page: https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CoTton-67k
CoTton-67k is a 67,844-example dataset of soft reasoning conversations in the ShareGPT format. Each entry contains an exchange between a user and a model, showcasing high-quality Chain-of-Thought (CoT) reasoning in natural language. The dataset was presented in the paper OpenThoughts: Data Recipes for Reasoning Models. Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about… See the full description on the dataset page: https://huggingface.co/datasets/NewstaR/CoTton-67k-6725-Collective.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
[!NOTE] We have released a paper for OpenThoughts! See our paper here.
Open-Thoughts-114k
Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.
Available Subsets
default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.