100+ datasets found
  1. Chain-of-Thought collection

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
    Explore at:
    zip(1260225915 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    Konrad Banachewicz
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

    From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.

  2. h

    bytesized32-world-model-cot

    • huggingface.co
    Updated May 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    THUML @ Tsinghua University (2025). bytesized32-world-model-cot [Dataset]. https://huggingface.co/datasets/thuml/bytesized32-world-model-cot
    Explore at:
    Dataset updated
    May 26, 2025
    Dataset authored and provided by
    THUML @ Tsinghua University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    World
    Description

    See https://github.com/thuml/RLVR-World for examples for using this dataset.

      Citation
    

    @article{wu2025rlvr, title={RLVR-World: Training World Models with Reinforcement Learning}, author={Jialong Wu and Shaofeng Yin and Ningya Feng and Mingsheng Long}, journal={arXiv preprint arXiv:2505.13934}, year={2025}, }

  3. F

    Chinese Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Chinese Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/chinese-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Chinese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.

    Dataset Content

    This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Chinese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.

    Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Chinese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.

    Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.

    Prompt Diversity

    To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.

    These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.

    Response Formats

    To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.

    These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Chinese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Chinese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.

    License

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Chinese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  4. h

    Open-CoT-Reasoning-Mini

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raymond Lee (2025). Open-CoT-Reasoning-Mini [Dataset]. http://doi.org/10.57967/hf/5566
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    Raymond Lee
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Introducing Open-CoT-Reasoning-Mini:

    An open source dataset of 10,200 distilled Chain-of-Thought (CoT) reasoning samples across diverse domains including mathematics, medicine, art, social sciences, computer science, logic puzzles, etc. This comprehensive collection is designed to boost step-by-step reasoning capabilities in language models under 10 billion parameters, enabling any non-reasoning model to develop structured analytical thinking across multiple disciplines.… See the full description on the dataset page: https://huggingface.co/datasets/Raymond-dev-546730/Open-CoT-Reasoning-Mini.

  5. F

    Tamil Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Tamil Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.

    Dataset Content

    This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Tamil language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.

    Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Tamil people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.

    Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.

    Prompt Diversity

    To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.

    These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.

    Response Formats

    To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.

    These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Tamil Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.

    License

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  6. t5 reasoning cot

    • kaggle.com
    zip
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anandita Garg (2025). t5 reasoning cot [Dataset]. https://www.kaggle.com/datasets/ananditaaaaa/t5-reasoning-cot
    Explore at:
    zip(1346298667 bytes)Available download formats
    Dataset updated
    Jul 22, 2025
    Authors
    Anandita Garg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Anandita Garg

    Released under Apache 2.0

    Contents

  7. Math CoT Arabic English Reasoning

    • kaggle.com
    zip
    Updated May 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miscovery (2025). Math CoT Arabic English Reasoning [Dataset]. https://www.kaggle.com/datasets/miscovery/math-cot-arabic-english-reasoning
    Explore at:
    zip(920398 bytes)Available download formats
    Dataset updated
    May 16, 2025
    Authors
    Miscovery
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Math CoT Arabic English Dataset

    A high-quality, bilingual (English & Arabic) dataset for Chain-of-Thought (COT) reasoning in mathematics and related disciplines, developed by Miscovery AI.

    Overview

    Math-COT is a unique dataset designed to facilitate and benchmark the development of chain-of-thought reasoning capabilities in language models across mathematical domains. With meticulously crafted examples, explicit reasoning steps, and bilingual support, this dataset offers a robust foundation for training and evaluating mathematical reasoning abilities.

    Key Features

    • 99% Clean & High-Quality Data: Human-reviewed, accurately annotated examples with verified solutions
    • Bilingual Support: Complete English and Arabic parallel content for cross-lingual research and applications
    • Structured Reasoning Steps: Each problem solution is broken down into explicit step-by-step reasoning
    • Diverse Subject Coverage: Spans 21 different categories within mathematics and adjacent fields
    • Comprehensive Format: Includes questions, answers, reasoning chains, and relevant metadata

    Dataset Structure

    Each entry in the dataset contains the following fields:

    {
     "en_question": "Question text in English",
     "ar_question": "Question text in Arabic",
     "en_answer": "Detailed step-by-step solution in English",
     "ar_answer": "Detailed step-by-step solution in Arabic",
     "category": "Mathematical category",
     "en_q_word": "Word count of English question",
     "ar_q_word": "Word count of Arabic question",
     "en_a_word": "Word count of English answer",
     "ar_a_word": "Word count of Arabic answer"
    }
    

    Categories

    The dataset covers 21 distinct categories:

    1. Mathematics - Arithmetic
    2. Mathematics - Algebra
    3. Mathematics - Geometry
    4. Mathematics - Trigonometry
    5. Mathematics - Calculus
    6. Mathematics - Linear Algebra
    7. Mathematics - Probability
    8. Mathematics - Statistics
    9. Mathematics - Set Theory
    10. Mathematics - Number Theory
    11. Mathematics - Discrete Math
    12. Mathematics - Topology
    13. Mathematics - Differential Equations
    14. Mathematics - Real Analysis
    15. Math Puzzles
    16. Linguistics
    17. Logic and Reasoning
    18. Philosophy
    19. Sports and Games
    20. Psychology
    21. Cultural Traditions

    Example

    Here's a sample entry from the dataset:

    {
     "en_question": "A bag contains only red and blue balls. If one ball is drawn at random, the probability that it is red is 2/5. If 8 more red balls are added, the probability of drawing a red ball becomes 4/5. How many blue balls are there in the bag?",
     "ar_question": "تحتوي الحقيبة على كرات حمراء وزرقاء فقط. إذا تم سحب كرة واحدة عشوائيًا ، فإن احتمال أن تكون حمراء هو 2/5. إذا تمت إضافة 8 كرات حمراء أخرى ، يصبح احتمال سحب كرة حمراء 4/5. كم عدد الكرات الزرقاء الموجودة في الحقيبة؟",
    

    Usage

    This dataset is especially valuable for:

    • Training and evaluating mathematical reasoning in language models
    • Research on step-by-step problem solving approaches
    • Developing educational AI assistants for mathematics
    • Cross-lingual research on mathematical reasoning
    • Benchmarking Chain-of-Thought (COT) capabilities

    Citation

    If you use this dataset in your research, please cite:

    @dataset{miscoveryai2025mathcot,
     title={Math CoT Arabic English Reasoning: A Bilingual Dataset for Chain-of-Thought Mathematical Reasoning},
     author={Miscovery AI},
     year={2025},
     publisher={Kaggle},
     url={https://www.kaggle.com/datasets/miscovery/math-cot-arabic-english-reasoning}
    }
    

    License

    This project is licensed under the MIT License - see the LICENSE file for details.

    Contact

    For questions, feedback, or issues related to this dataset, please contact Miscovery AI at info@miscovery.com.

  8. Eedi-Cot-7B base model LoRA MAP

    • kaggle.com
    zip
    Updated Oct 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jatin Mehra_666 (2025). Eedi-Cot-7B base model LoRA MAP [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/eedi-cot-7b-base-model-lora-map
    Explore at:
    zip(245134843 bytes)Available download formats
    Dataset updated
    Oct 14, 2025
    Authors
    Jatin Mehra_666
    Description

    Dataset

    This dataset was created by Jatin Mehra_666

    Contents

  9. F

    Hindi Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/hindi-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Hindi Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.

    Dataset Content

    This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Hindi language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.

    Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Hindi people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.

    Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.

    Prompt Diversity

    To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.

    These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.

    Response Formats

    To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.

    These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Hindi Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Hindi version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.

    License

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Hindi Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  10. Know Saraswati COT

    • kaggle.com
    • huggingface.co
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Know Saraswati COT [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-source-logical-reasoning-dataset
    Explore at:
    zip(43869884 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open Source Logical Reasoning Dataset

    Exploring Stream of Consciousness Thinking with GPT-4

    By Huggingface Hub [source]

    About this dataset

    Know-Saraswati-COT is an open source dataset of powerful tools to support the training of models in logical reasoning and stream of consciousness thinking. Designed to advance knowledge unlocktion for everyone, this dataset was created using GPT-4 technology as an homage to Goddess Saraswati, the embodiment of wisdom and enlightenment. Guided by her grace, this corpus has been crafted with aim towards delving into deep introspection where thought processes and free flows can be analyzed. Encompassing both logic and creativity, Know-Saraswati-COT enables users to craft AI machine learning models that can encompass both analytical capacity and imaginative possibilities. This streamlined access point paths towards converting raw data into a standardized language encompassing syntax structure as well as understanding arguments --critical components for creative computational thought processes on a broad scale. Thus, Know-Saraswati-COT revolutionizes how we approach developing machines that understand not only instructions but also complex concepts that require comprehensive understanding for successful execution in real world applications

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    To begin working with this dataset, start by downloading the ‘Train.csv’ file from Kaggle which contains instructions and corresponding outputs for training models in logical reasoning and stream of consciousness thinking. The columns in this file include 'instruction' - which is the instruction given to a machine learning model - as well as the 'output' that has been generated by that model based on its own interpretation of the instruction received.

    Once you have downloaded your dataset, it is important to make sure that it was downloaded correctly by carrying out some basic tests like verifying if all columns have been populated correctly or not. Verify if any instructions are repeating themselves within your file or not, as this will provide insight into how many examples you can use for training purposes, as well as help develop better systems over time through the process of continual improvement driven by feedback loops from users using these datasets regularly over time.

    You can then start using data processing techniques such as normalization, feature extraction, etc., so a Machine Learning (ML) model can be trained properly on your dataset before making predictions about future test cases while testing model accuracy respectively. This could involve breaking up long strings into separate words/words-phrases or Malta-Grid Analysis etc., depending on which features need to be extracted from an individual string/instruction given within your dataset respectively. Increasingly complex scenarios could also demand additional data engineering processes such as Speech Recognition Parsing for extracting text information from audio formats/speech recognition applications etc., according to individual needs per project respectively so larger amounts of useful features can be captured accurately when capturing knowledge associated with any given topic discussed between humans naturally during conversation related situations ultimately aimed at helping humans better understand each other at further benefiting businesses through improved customer experience management techniques respectively later down their chosen paths right now today if they decide upon leveraging ML-related technologies appropriately towards future directions concurrently being applied across their landscapes right now today moving forward too now simultaneously facilities ascendant opportunities effectively along similarlands wayspaces strides past expected iterations eullated terms fitted interstingly conditions enquired sentiments reported outcomes outcomes retrieved conclusions signaled protocolized sets increasingly granularly blindly resignations metricus increments constantously occupying apps

    Research Ideas

    • Using Know-Saraswati-COT to create engaging story lines by training models to generate new stories with logical reasoning and stream of consciousness thought processes.
    • Training AI models to develop strong creative writing skills, especially for science fiction and fantasy genres.
    • Utilizing the data set to expand on knowledge resources in fields such as philosophy, psychology, science, art and culture by understanding the response of GPT-4 models better with natural language instruction inputs

    Acknowle...

  11. Automated Unit Test Generation via Chain of Thought Prompt and Reinforcement...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous (2025). Automated Unit Test Generation via Chain of Thought Prompt and Reinforcement Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28235750.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis is the replication package for the paper "Automated Unit Test Generation via Chain of Thought Prompt and Reinforcement Learning".Organization of the Replication Packagecheckpoints.zip: fine-tuned models, including TestCTRL, TestCT, TestCT-no-cot, TestCT-intention, TestCT-input, TestCT-ti, CodeBERT-line, CodeT5-line, CodeGPT-line, CodeBERT-branch, CodeT5-branch, and CodeGPT-branch.dataset.zip: Datasets for fine-tuning and reinforcement learning, including the CoT dataset, reward dataset (reward folder), and the dataset for PPO optimization (rl folder).evaluation.zip: scripts for evaluating the generated tests, including CodeBLEU, syntactic correct rate, compilation passing rate, line coverage rate, and branch coverage rate.finetune.zip: scripts and configs for fine-tuning large language models for test generation.generated_test_result.zip: the generated tests.pretrain.zip: pre-trained models, including CodeLlama, CodeBERT, CodeT5, and CodeBERT.CoT_quality.zip: the example of evaluating CoT prompts.

  12. h

    CoT-Verification-340k

    • huggingface.co
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zigeng Chen (2025). CoT-Verification-340k [Dataset]. https://huggingface.co/datasets/Zigeng/CoT-Verification-340k
    Explore at:
    Dataset updated
    May 26, 2025
    Authors
    Zigeng Chen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CoT-Verification-340k Dataset: Improving Reasoning Model Efficiency through Verification

    This dataset is used for supervised verification fine-tuning of large reasoning models. It contains 340,000 question-solution pairs annotated with solution correctness, including 160,000 correct Chain-of-Thought (CoT) solutions and 190,000 incorrect ones. This data is designed to train models to effectively verify the correctness of reasoning steps, leading to more efficient and accurate… See the full description on the dataset page: https://huggingface.co/datasets/Zigeng/CoT-Verification-340k.

  13. R

    Cot Dataset

    • universe.roboflow.com
    zip
    Updated Apr 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nurhakim Norhazhar (2022). Cot Dataset [Dataset]. https://universe.roboflow.com/nurhakim-norhazhar/cot/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2022
    Dataset authored and provided by
    Nurhakim Norhazhar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Objects Bounding Boxes
    Description

    COT

    ## Overview
    
    COT is a dataset for object detection tasks - it contains Objects annotations for 2,221 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  14. h

    Compiled-COT

    • huggingface.co
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamesh R (2025). Compiled-COT [Dataset]. https://huggingface.co/datasets/Kameshr/Compiled-COT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2025
    Authors
    Kamesh R
    Description

    Compiled-CoT: Enhancing Chain-of-Thought Reasoning

    Compiled-CoT is a framework designed to improve Chain-of-Thought (CoT) reasoning capabilities in language models by leveraging curated datasets, refined prompting techniques, and adaptive learning mechanisms. It is designed to enhance model reasoning across various domains, especially in mathematical, logical, and commonsense tasks.

      Contributing
    

    Contributions are welcome! If you'd like to improve the framework or add new… See the full description on the dataset page: https://huggingface.co/datasets/Kameshr/Compiled-COT.

  15. F

    Bengali Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Bengali Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/bengali-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Bengali Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.

    Dataset Content

    This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Bengali language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.

    Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Bengali people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.

    Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.

    Prompt Diversity

    To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.

    These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.

    Response Formats

    To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.

    These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Bengali Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Bengali version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.

    License

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Bengali Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  16. h

    HumanRef-CoT-45k

    • huggingface.co
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IDEA-Research (2025). HumanRef-CoT-45k [Dataset]. https://huggingface.co/datasets/IDEA-Research/HumanRef-CoT-45k
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset authored and provided by
    IDEA-Research
    Description

    🦖🧠 Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning 🦖🧠

    We propose Rex-Thinker, a Chain-of-Thought (CoT) reasoning model for object referring that addresses two key challenges: lack of interpretability and inability to reject unmatched expressions. Instead of directly predicting bounding boxes, Rex-Thinker reasons step-by-step over candidate objects to determine which, if any, match a given expression.… See the full description on the dataset page: https://huggingface.co/datasets/IDEA-Research/HumanRef-CoT-45k.

  17. h

    rich-cot

    • huggingface.co
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Storf (2025). rich-cot [Dataset]. https://huggingface.co/datasets/Syghmon/rich-cot
    Explore at:
    Dataset updated
    Nov 27, 2025
    Authors
    Simon Storf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Scheming Detection CoT Dataset

      Dataset Description
    

    This dataset contains Chain-of-Thought (CoT) reasoning for the scheming detection task. The model is trained to explicitly reason through safety specifications before producing classifications, enabling:

    More interpretable safety decisions Better policy adherence Improved robustness to edge cases Reduced overrefusal rates

      Dataset Statistics
    

    Total Samples: 44,129 Generated: 2025-11-27 Generation Model:… See the full description on the dataset page: https://huggingface.co/datasets/Syghmon/rich-cot.

  18. Comparison with the results of LLMs with CoT prompt.

    • plos.figshare.com
    xls
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin Jin; Fan Wang; Shengzheng Tian (2025). Comparison with the results of LLMs with CoT prompt. [Dataset]. http://doi.org/10.1371/journal.pone.0330684.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 3, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jin Jin; Fan Wang; Shengzheng Tian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison with the results of LLMs with CoT prompt.

  19. G

    Camping Cot Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Camping Cot Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/camping-cot-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Camping Cot Market Outlook



    According to our latest research, the global camping cot market size reached USD 1.34 billion in 2024, reflecting a robust and expanding industry. The market is poised for steady growth, with a projected compound annual growth rate (CAGR) of 5.7% from 2025 to 2033. By the end of 2033, the camping cot market is forecasted to achieve a value of USD 2.22 billion. This growth trajectory is primarily driven by the increasing popularity of outdoor recreational activities, rising disposable incomes, and the growing trend of adventure tourism worldwide. As per our latest research, these factors are expected to continue fueling demand for camping cots across diverse end-user segments over the next decade.




    The primary growth factor propelling the camping cot market is the surge in outdoor recreational activities, including camping, hiking, and backpacking. As urbanization intensifies and lifestyles become more hectic, consumers are increasingly seeking opportunities to reconnect with nature and pursue wellness through outdoor experiences. This shift in consumer behavior has resulted in heightened demand for comfortable and convenient camping gear, with camping cots emerging as a preferred choice for ensuring restful sleep in outdoor environments. The proliferation of camping sites, national parks, and adventure travel operators has further contributed to the widespread adoption of camping cots, especially among millennials and families looking for safe and ergonomic sleeping solutions during their excursions.




    Another crucial driver is the continuous innovation in materials and product design, which has significantly enhanced the functionality and portability of camping cots. Manufacturers are leveraging advanced materials such as lightweight aluminum alloys, high-tensile steel, and weather-resistant fabrics to create durable yet easy-to-carry products. The introduction of folding and portable camping cots has made it feasible for users to transport and set up their sleeping arrangements with minimal effort, catering to both individual campers and group expeditions. Additionally, the market has witnessed the emergence of specialized camping cots, including double cots for couples, kidsÂ’ cots for families, and heavy-duty models for commercial or institutional use, further expanding the consumer base.




    The expanding e-commerce ecosystem and the rise of online retail channels have also played a pivotal role in accelerating market growth. Online platforms provide consumers with access to a wide array of camping cot options, detailed product descriptions, user reviews, and competitive pricing, thus enabling informed purchase decisions. The convenience of home delivery and easy return policies has encouraged more consumers to invest in camping gear online, especially in regions where physical specialty stores may be limited. This digital transformation, coupled with targeted marketing campaigns by leading brands, has significantly boosted market penetration and awareness, particularly among tech-savvy and younger demographics.



    In recent years, the concept of pet camping has gained traction among outdoor enthusiasts, leading to the development of specialized products like the Pet Camping Cot. These cots are designed to provide a comfortable and elevated sleeping surface for pets, ensuring they can rest safely and comfortably during camping trips. With features such as durable frames, weather-resistant fabrics, and easy portability, pet camping cots have become a popular choice for pet owners who want to include their furry companions in outdoor adventures. As more families and individuals embrace pet-friendly travel, the demand for pet camping cots is expected to rise, encouraging manufacturers to innovate and expand their product offerings to cater to this growing market segment.




    From a regional perspective, North America currently dominates the camping cot market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. North AmericaÂ’s leadership can be attributed to its established outdoor recreation culture, extensive network of campsites, and high consumer spending on leisure activities. EuropeÂ’s market is also robust, supported by a strong tradition of camping and well-developed tourism infrastructure. Meanwhile, the Asia Pacific region is wi

  20. CCI4.0-M2-CoT-v1

    • huggingface.co
    Updated May 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beijing Academy of Artificial Intelligence (2025). CCI4.0-M2-CoT-v1 [Dataset]. https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Beijing Academy of Artificial Intelligence
    Description

    CCI4.0-M2 v1 Dataset Documentation

    Tech Report👁

      Overview
    

    CCI4.0-M2 v1 is a comprehensive dataset collection consisting of two specialized subsets designed for language model training.

    CCI4.0-M2-Base v1 CCI4.0-M2-CoT v1

    Download Link BAAI_datahub / modelscope / hf BAAI_datahub / modelscope / hf

    Notes 5.2TB Chinese webpage, 22TB English webpage, some data released in CCI4.0-M2-Extra(BAAI_datahub / modelscope / hf) due to the license concern. 430 million CoT… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
Organization logo

Chain-of-Thought collection

Multilingual CoT rationales for LLM finetuning - https://arxiv.org/abs/2305.140

Explore at:
zip(1260225915 bytes)Available download formats
Dataset updated
Jun 19, 2023
Authors
Konrad Banachewicz
License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.

Search
Clear search
Close search
Google apps
Main menu