6 datasets found
  1. h

    OpenThoughts-114k

    • huggingface.co
    Updated Jan 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Thoughts (2025). OpenThoughts-114k [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k
    Explore at:
    Dataset updated
    Jan 28, 2025
    Dataset authored and provided by
    Open Thoughts
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    [!NOTE] We have released a paper for OpenThoughts! See our paper here.

      Open-Thoughts-114k
    

    Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.

      Available Subsets
    

    default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.

  2. h

    OpenThoughts3-1.2M

    • huggingface.co
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Thoughts (2025). OpenThoughts3-1.2M [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset authored and provided by
    Open Thoughts
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    paper | dataset | model

    [!NOTE] We have released a paper for OpenThoughts! See our paper here.

      OpenThoughts3-1.2M
    

    Open-source state-of-the-art reasoning dataset with 1.2M rows. 🚀 OpenThoughts3-1.2M is the third iteration in our line of OpenThoughts datasets, building on our previous OpenThoughts-114k and OpenThoughts2-1M. This time around, we scale even further and generate our dataset in a much more systematic way -- OpenThoughts3-1.2M is the result of a… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M.

  3. h

    OpenThoughts2-1M

    • huggingface.co
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Thoughts (2025). OpenThoughts2-1M [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M
    Explore at:
    Dataset updated
    Apr 7, 2025
    Dataset authored and provided by
    Open Thoughts
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    [!NOTE] We have released a paper for OpenThoughts! See our paper here.

      OpenThoughts2-1M
    

    Open synthetic reasoning dataset with 1M high-quality examples covering math, science, code, and puzzles! OpenThoughts2-1M builds upon our previous OpenThoughts-114k dataset, augmenting it with existing datasets like OpenR1, as well as additional math and code reasoning data. This dataset was used to train OpenThinker2-7B and OpenThinker2-32B. Inspect the content with rich… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M.

  4. h

    OpenThoughts-10k-DeepSeek-R1

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Galichin (2025). OpenThoughts-10k-DeepSeek-R1 [Dataset]. https://huggingface.co/datasets/andreuka18/OpenThoughts-10k-DeepSeek-R1
    Explore at:
    Dataset updated
    Mar 31, 2025
    Authors
    Andrey Galichin
    Description

    This repository contains the dataset used in the paper I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. Code: https://github.com/AIRI-Institute/SAE-Reasoning

  5. h

    DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Galichin (2025). DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized
    Explore at:
    Dataset updated
    Mar 31, 2025
    Authors
    Andrey Galichin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is used for training Sparse Autoencoders (SAEs) to identify reasoning features in Large Language Models (LLMs), as described in the paper I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. Code for the paper is available at: https://github.com/AIRI-Institute/SAE-Reasoning The dataset consists of tokenized text data used for training the SAEs. dataset_info: features:

    name: tokens sequence: int64 splits: name:… See the full description on the dataset page: https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-lmsys-openthoughts-tokenized.

  6. h

    CoTton-67k-6725-Collective

    • huggingface.co
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Newstar Research ASIA (2025). CoTton-67k-6725-Collective [Dataset]. https://huggingface.co/datasets/NewstaR/CoTton-67k-6725-Collective
    Explore at:
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Newstar Research ASIA
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    CoTton-67k

    CoTton-67k is a 67,844-example dataset of soft reasoning conversations in the ShareGPT format. Each entry contains an exchange between a user and a model, showcasing high-quality Chain-of-Thought (CoT) reasoning in natural language. The dataset was presented in the paper OpenThoughts: Data Recipes for Reasoning Models. Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about… See the full description on the dataset page: https://huggingface.co/datasets/NewstaR/CoTton-67k-6725-Collective.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Open Thoughts (2025). OpenThoughts-114k [Dataset]. https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k

OpenThoughts-114k

open-thoughts/OpenThoughts-114k

Explore at:
34 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 28, 2025
Dataset authored and provided by
Open Thoughts
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

[!NOTE] We have released a paper for OpenThoughts! See our paper here.

  Open-Thoughts-114k

Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer.

  Available Subsets

default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.

Search
Clear search
Close search
Google apps
Main menu