Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
NaturalReasoning is a large-scale dataset for general reasoning tasks. It consists of high-quality challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The questions have been deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, MMLU-STEM. For each question, we extract the reference final answer from the original document from the pretraining corpora if possible. We also provide a model-generated response from⦠See the full description on the dataset page: https://huggingface.co/datasets/facebook/natural_reasoning.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Natural Sciences Reasoning: the "smolest" reasoning dataset
A smol-scale open dataset for reasoning tasks using Hugging Face Inference Endpoints. While intentionally limited in scale, this resource prioritizes:
Reproducible pipeline for reasoning tasks using a variety of models (Deepseek V3, Deepsek-R1, Llama70B-Instruct, etc.)
Knowledge sharing for domains other than Math and Code reasoning
In this repo, you can find:
The prompts and the pipeline (see the config file). The⦠See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/natural-science-reasoning.
Facebook Natural Reasoning - Curriculum Learning
This dataset is a curriculum learning version of the Facebook Natural Reasoning dataset, automatically split into four complexity levels:
Splits
simple_reasoning: 28898 examples - Basic arithmetic, one-step problems
basic_reasoning: 713776 examples - Chained operations, single-variable algebra
intermediate_reasoning: 397706 examples - Proportional reasoning, multi-step problems
complex_reasoning: 5444 examples -⦠See the full description on the dataset page: https://huggingface.co/datasets/essobi/facebook-natural-reasoning-curriculum.
DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.
qingyangzhang/Natural-Reasoning-STEM-25K dataset hosted on Hugging Face and contributed by the HF Datasets community
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/C56QUVhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/C56QUV
Understanding a social situation requires the ability to reason about the underlying emotions and behaviour of others. For example, when we read a personal story, we use our prior commonsense knowledge and social intelligence to infer the emotions, motives, and anticipate the actions of the characters in a story. For machines to understand text related to \textit{personal stories and social conversations}, they must be able to make commonsense inferences. While most people can reason deeply about the social implications of the text, it is challenging for natural language processing systems as these implications are often subtle and implicit. This dissertation argues that NLP systems must learn to reason more explicitly about the underlying social knowledge in text to perform social commonsense reasoning. We divide the above argument into two sub-problems: (i) understanding the underlying social knowledge and (ii) explicitly reasoning about such knowledge for social commonsense reasoning. To address these problems, we propose building NLP systems that integrate neural network based learning with structured knowledge representations. In the first part of this dissertation, we study the role of structured commonsense knowledge in understanding the social dynamics of characters and their actions in stories. Our motivation behind enriching the model with structured commonsense knowledge is to bridge the gap between surface meanings of texts and the underlying social implication of each event in the stories. We develop a novel model that incorporates commonsense knowledge into neural models and showcases the importance of commonsense knowledge in understanding social dynamics of story characters. Further, we investigate the role of temporal dynamics of story events in understanding social situations. We develop a model that can explicitly learn about \textit{what social event follows another event} from personal narrative stories. We demonstrate that \textit{implicitly} leveraging such temporal knowledge about story events can support social commonsense reasoning tasks. In the second part of this dissertation, we investigate methods to explicitly reason about the knowledge related to social dynamics of characters (behavior, mental states) and cause/effect of social events. We propose a novel model named as \textit{multi-head knowledge attention} that incorporates such social knowledge into state-of-the-art neural NLP models to address two complex commonsense inference tasks. We demonstrate that our method of incorporating knowledge can improve -- (i) the robustness and the interpretability of the model and (ii) the overall performance of the model compared to other knowledge integration methods. We also aim to investigate social commonsense reasoning as a natural language generation task. We design a story completion task that requires natural language generation models to perform both forward and backward reasoning. We study the role of contextualized commonsense knowledge in natural language generation tasks. We propose a model that jointly learns to generate contextualized inference rules as well as narrative stories. We demonstrate that our model can outperform state-of-the-art non-contextualized commonsense knowledge-based generation models. We hope that the research presented in this dissertation will open up interesting scopes for future research involving social commonsense reasoning and other related topics.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Baselines models for the paper lilGym: Natural Language Visual Reasoning with Reinforcement Learning.
NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset can be used for semantic parsing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more than 20 years, research has proven the beneficial effect of natural frequencies when it comes to solving Bayesian reasoning tasks (Gigerenzer and Hoffrage, 1995). In a recent meta-analysis, McDowell and Jacobs (2017) showed that presenting a task in natural frequency format increases performance rates to 24% compared to only 4% when the same task is presented in probability format. Nevertheless, on average three quarters of participants in their meta-analysis failed to obtain the correct solution for such a task in frequency format. In this paper, we present an empirical study on what participants typically do wrong when confronted with natural frequencies. We found that many of them did not actually use natural frequencies for their calculations, but translated them back into complicated probabilities instead. This switch from the intuitive presentation format to a less intuitive calculation format will be discussed within the framework of psychological theories (e.g., the Einstellung effect).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cognitive psychology has shown that understanding numerical information is deeply related to the format in which this information is presented; percentages are difficult to grasp whereas frequency formats are intuitively accessible. This plays a vital role in the medical domain where difficult risk-related probability judgments have to be made both by professionals and their patients. In this article, we demonstrate that the idea of representing statistical information in terms of frequency formats is not only helpful for communicating risks, but can be applied to primary school stochastics when percentages and fractions are not available. For this purpose, we report on an intervention study conducted in grade 4 in primary school. The results show, on the one hand, that primary school students could already solve Bayesian reasoning tasks in the pretest when natural frequencies were used. On the other hand, the students profited from the intervention where they used different representations, namely colored tinker cubes and natural frequencies in order to describe and quantify frequencies and probabilities. These results go along with findings from cognitive psychology that activities with hands-on material as well as pointing out to the underlying nested-sets structure can foster Bayesian reasoning. The results are discussed in particular with regard to teaching stochastics in (primary) school.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Previous research on Bayesian reasoning has typically investigated peopleās ability to assess a posterior probability (i.e., a positive predictive value) based on prior knowledge (i.e., base rate, true-positive rate, and false-positive rate). In this article, we systematically examine the extent to which people understand the effects of changes in the three input probabilities on the positive predictive value, that is, covariational reasoning. In this regard, two different operationalizations for measuring covariational reasoning (i.e., by single-choice vs. slider format) are investigated in an empirical study with Nā=ā229 university students. In addition, we aim to answer the question wheter a skill in āconventionalā Bayesian reasoning is a prerequisite for covariational reasoning.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is derived from the Natural Questions (NQ) dataset which is a large benchmark for open question answering research (https://ai.google.com/research/NaturalQuestions).
This dataset contains a collection of cause-effect pairs along with their context (the text describing the causal relation between the cause and the effect) as well as the original question in the NQ data set. It also contains a collection of "negative" pairs, phrases that are mentioned in the context but have no causal relation.
This dataset is constructed by filtering questions in the NQ dataset that follow a certain pattern indicating that the question is causal. Either the cause or the effect is the (short) answer in the original NQ dataset, and the other side is manually derived from the context.
The data is shared in JSONL format with every line being a processed NQ question with relevant fields described above. In versions 2 and 2.5, each JSON object has the following fields:
Versions 1.5 and 2.5 enforce exactness of causes and effects as seen verbatim in the context passage. This is different from version 1 and 2 because there we are interested in how causes and effects are annotated as per human understanding and manual curation, i.e. how a human being would parse and understand causes and effects in text and to enforce simple grammatical consistency in causes and effects. Eg. in a passage, say ... "Above - average sea water temperatures caused by global warming is the leading cause of coral bleaching." Version 1 and 2 could say something like, "above average sea water temperatures" while stating the cause but versions 1.5 and 2.5 will always say, "Above - average sea water temperatures", i.e. the cause and effect will be exactly as seen in text. These versions help in matching results of any causal relation extraction (CRE) algorithm and method better, e.g. obtaining results of CRE using some method X and comparing against what is present as causes and effects. Hence these versions (1.5 and 2.5) are also called Evaluation versions.
License: https://creativecommons.org/licenses/by-sa/3.0/
Contacts:
Gaurav Dass: dassg2 AT rpi.edu
Oktie Hassanazadeh
Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manually-defined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. ..., We use a three-stage annotation pipeline. The first stage focuses on collecting human-written event descriptions using event cloze tasks, but only for a small number of videos. In the second stage, we augment the data for all videos using neural event description generators trained on the data collected from the first stage. In the third stage, we condense CEGs by collecting binary causal relation labels for all pairs of events from humans.,
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Paper: "CLadder: Assessing Causal Reasoning in Language Models" (NeurIPS 2023) by Zhijing Jin*, Yuen Chen*, Felix Leeb*, Luigi Gresele*, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schƶlkopf. (http://arxiv.org/abs/2312.04350) Abstract: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs. Our data is open-sourced at https://github.com/causalNLP/cladder, and our code can be found at https://huggingface.co/datasets/causalnlp/CLadder.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset consists of 52K instruction-following data generated by GPT-4 in English using the same prompts as in Alpaca. This data has been crafted specifically to help researchers break ground and explore new strategies for natural language processing, with a special focus on instruction-following reasoning.
What makes this dataset unique and powerful is that it offers an ample variety of options for experimenting with models that can excel at instruction following tasks; from refining specific components such as predicting outputs or analyzing long textual conversations, to using the entire platform to train and evaluate end-to-end approaches. Allowing researchers the opportunity to rapidly iterate their experiments while having the confidence of a high performant model with few limitations - making this an invaluable resource for anyone looking to push the boundaries of artificial intelligence techniques for logical reasoning problems
For more datasets, click here.
- šØ Your notebook can be here! šØ!
This dataset is an invaluable resource for researching artificial intelligence approaches to logical reasoning problems. This dataset consists of 52K instruction-following samples generated by GPT-4 in English using the same prompts as in Alpaca. Here are some tips on how to make the most out of this dataset:
The columns in this dataset provide essential data that can help researchers evaluate their models on a task involving instruction following:
instruction
,input
,output
andtext
. In order to effectively use this data, it is important for researchers to be familiar with each column and understand its purpose and contribution towards understanding instructional following principles. a) Theinstruction
column provides a statement which an AI model must interpret in order for it complete a task correctly; b) The 'input' column is basically pre-generated data that helps an AI model make sense of the instructions; c) The 'output' column indicates what kind of result must be returned after the AI model interprets instructions correctly; and finally,
d) The ātextā column is full text generated by GPT-4 which gives us deeper insight into what gave rise our output results from input & instruction handling.Note : It's very important that researchers pay attention to all four columns when overseeing their work on such datasets, as all four components collaborate together integrately.
To get better results one should consider fine tuning existing schemes so they become better suited for instruction following tasks using these 4 columns as guidance points. It would be also useful if the datasets came with corresponding hyperparameters so users can fine tune them quicker without losing accuracy or any other metric needed on such scenarios!
Additionally, readers should Oyverviewedthe contextcloserlytoaccuracy assessthepunishmeasure opinion toneandGoforwhichmodeltypebestsuitsitcaseization given before attempting any sort of evaluation since some might bringmore accurateresultsbuttakelongertoprocess ore viceversa!yerinaredaviews satismetricmayvariaentdataobservioletorsalld .yCdgntricular error%mnfreeunerratreated too accommodate certain scenarios better than others but will still depend largely onthedatasetaccuratelyusedtocourubricateperformances026 (269units). For example, if changes are
- Training intelligent conversational agents with instruction-following reasoning capabilities.
- Developing more complex and powerful instructions processing models driven by natural language understanding and reasoning algorithms.
- Establishing an online platform to help academic, business or other organizations to construct auto-grading systems for instruction-following skills evaluation of their staff at large scale in a relatively cheap way
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Colu...
QUITE (Quantifying Uncertainty in natural language Text) is an entirely new benchmark that allows for assessing the capabilities of neural language model-based systems w.r.t. to Bayesian reasoning on a large set of input text that describes probabilistic relationships in natural language text.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 1,000 entries organized in a structured JSON format. Each entry consists of two fields: instruction and output. The instruction field presents a high-difficulty mathematical problem in natural language, while the output field contains the complete reasoning process generated by a large reasoning model. Specifically, the output includes a think field, which provides a detailed long-chain reasoning solution, and an answer field, which summarizes the final standard answer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As a core cognitive skill that enables the transferability of information across domains, analogical reasoning has been extensively studied for both humans and computational models. However, while cognitive theories of analogy often focus on narratives and study the distinction between surface, relational, and system similarities, existing work in natural language processing has a narrower focus as far as relational analogies between word pairs. This gap brings a natural question: can state-of-the-art large language models (LLMs) detect system analogies between narratives? To gain insight into this question and extend word-based relational analogies to relational system analogies, we devise a comprehensive computational framework that operationalizes dominant theories of analogy, using narrative elements to create surface and system mappings. Leveraging the interplay between these mappings, we create a binary task and benchmark for Analogical Reasoning on Narratives (ARN), covering four categories of far (cross-domain)/near (within-domain) analogies and disanalogies. We show that while all LLMs can largely recognize near analogies, even the largest ones struggle with far analogies in a zero-shot setting, with GPT4.0 scoring below random. Guiding the models through solved examples and chain-of-thought reasoning enhances their analogical reasoning ability. Yet, since even in the few-shot setting, the best model only performs halfway between random and humans, ARN opens exciting directions for computational analogical reasoners.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The experimental AI Dialog Software Application is able to build a Mental Model from a plain English Dialogue input. Subsequently, Comprehension and Logical Capability may be tested using plain English Queries. AI Dialog is used to experimentally test the validity and utility of a novel software application design, as a path to Conversational Artificial Intelligence. The theory behind AI Dialog is fully described in the book: Belch, Francis R. (2021) Artificial Intelligence That Can Comprehend, Available at: Amazon.com.
There are also two YouTube lectures each of about 1 hour duration describing a radical new theory of Linguistic Semantics used to implement the AI Dialog Software Application. These are:
Semantics I - A radical new approach - link address - https://www.youtube.com/watch?v=aTVRp2-9niU&t=209s
Semantics II - Dialogues & Mental Models - link address - https://www.youtube.com/watch?v=6S9VG9sINmg&t=37s
This is a download of the executable of the AI Dialog Software Application Version 3.2 Alpha Release. This version supersedes Version 3.1 to allow both U.K. and U.S. spelling in user input.
The AI Dialog Software is protected by international copyright, but is made available to use for non-commercial personal study purposes.
The application will run on Windows 10Ā® PC, Laptop and Tablet systems, and requires about 1 M byte.
The download file is zipped and needs to be unzipped. After this, the content of the folder AI Dialog 3.2 Alpha Release is:
⢠Application Files (Folder) ⢠Documentation (Folder) ⢠NLP2016Autumn (Manifest) ⢠Setup (Application)
In the Documentation folder are two PDF files:
⢠Appendix I - Tuition Lessons (PDF) ⢠Appendix II ā AI Dialog Specification (PDF)
The first is a hard copy of the tuition lessons. The second is a specification of a subset of English for use with the AI Dialog system. However, there is no need to consult either of these initially, as AI Dialog incorporates a quick start tuition module.
To install AI Dialog, double click the Setup file. This starts AI Dialog immediately after installation, but places an application icon on the Windows 10Ā® Start list for restarting later.
After AI Dialog starts, just follow the pop-up message instructions, which lead to a quick start interactive tuition module, fully describing how to use the application.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Synthetic reasoning dataset
Original version:
https://huggingface.co/datasets/lighteval/synthetic_reasoning_natural
Translation source code: https://github.com/martinakaduc/ura-llama/tree/main/dataset_scripts/custom_datasets
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
NaturalReasoning is a large-scale dataset for general reasoning tasks. It consists of high-quality challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The questions have been deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, MMLU-STEM. For each question, we extract the reference final answer from the original document from the pretraining corpora if possible. We also provide a model-generated response from⦠See the full description on the dataset page: https://huggingface.co/datasets/facebook/natural_reasoning.