6 datasets found

h
OpenThoughts-114k
huggingface.co
Updated Jan 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Thoughts (2025). OpenThoughts-114k [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k
Explore at:
Dataset updated
Jan 28, 2025
Dataset authored and provided by
Open Thoughts
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
[!NOTE] We have released a paper for OpenThoughts! See our paper here.

Open-Thoughts-114k

Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.

Available Subsets

default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.
h
OpenThoughts3-1.2M
huggingface.co
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Thoughts (2025). OpenThoughts3-1.2M [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M
Explore at:
Dataset updated
Jun 5, 2025
Dataset authored and provided by
Open Thoughts
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
paper | dataset | model

[!NOTE] We have released a paper for OpenThoughts! See our paper here.

OpenThoughts3-1.2M

Open-source state-of-the-art reasoning dataset with 1.2M rows. 🚀 OpenThoughts3-1.2M is the third iteration in our line of OpenThoughts datasets, building on our previous OpenThoughts-114k and OpenThoughts2-1M. This time around, we scale even further and generate our dataset in a much more systematic way -- OpenThoughts3-1.2M is the result of a… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M.
h
OpenThoughts2-1M
huggingface.co
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Thoughts (2025). OpenThoughts2-1M [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M
Explore at:
Dataset updated
Apr 7, 2025
Dataset authored and provided by
Open Thoughts
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
[!NOTE] We have released a paper for OpenThoughts! See our paper here.

OpenThoughts2-1M

Open synthetic reasoning dataset with 1M high-quality examples covering math, science, code, and puzzles! OpenThoughts2-1M builds upon our previous OpenThoughts-114k dataset, augmenting it with existing datasets like OpenR1, as well as additional math and code reasoning data. This dataset was used to train OpenThinker2-7B and OpenThinker2-32B. Inspect the content with rich… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M.
h
OpenThoughts-10k-DeepSeek-R1
huggingface.co
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Galichin (2025). OpenThoughts-10k-DeepSeek-R1 [Dataset]. https://huggingface.co/datasets/andreuka18/OpenThoughts-10k-DeepSeek-R1
Explore at:
Dataset updated
Mar 31, 2025
Authors
Andrey Galichin
Description
This repository contains the dataset used in the paper I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. Code: https://github.com/AIRI-Institute/SAE-Reasoning
h
DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized
huggingface.co
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Galichin (2025). DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized
Explore at:
Dataset updated
Mar 31, 2025
Authors
Andrey Galichin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is used for training Sparse Autoencoders (SAEs) to identify reasoning features in Large Language Models (LLMs), as described in the paper I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. Code for the paper is available at: https://github.com/AIRI-Institute/SAE-Reasoning The dataset consists of tokenized text data used for training the SAEs. dataset_info: features:

name: tokens sequence: int64 splits: name:… See the full description on the dataset page: https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized.
h
CoTton-67k-6725-Collective
huggingface.co
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Newstar Research ASIA (2025). CoTton-67k-6725-Collective [Dataset]. https://huggingface.co/datasets/NewstaR/CoTton-67k-6725-Collective
Explore at:
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Newstar Research ASIA
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
CoTton-67k

CoTton-67k is a 67,844-example dataset of soft reasoning conversations in the ShareGPT format. Each entry contains an exchange between a user and a model, showcasing high-quality Chain-of-Thought (CoT) reasoning in natural language. The dataset was presented in the paper OpenThoughts: Data Recipes for Reasoning Models. Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about… See the full description on the dataset page: https://huggingface.co/datasets/NewstaR/CoTton-67k-6725-Collective.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Open Thoughts (2025). OpenThoughts-114k [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k

OpenThoughts-114k

open-thoughts/OpenThoughts-114k

Explore at:

34 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jan 28, 2025

Dataset authored and provided by

Open Thoughts

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

[!NOTE] We have released a paper for OpenThoughts! See our paper here.

  Open-Thoughts-114k

Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.

  Available Subsets

default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.

Clear search

Close search

Google apps

Main menu

OpenThoughts-114k

OpenThoughts3-1.2M

OpenThoughts2-1M

OpenThoughts-10k-DeepSeek-R1

DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized

CoTton-67k-6725-Collective

OpenThoughts-114kSee More Versions

open-thoughts/OpenThoughts-114k

OpenThoughts-114k