10 datasets found

h
Data from: MathCheck
huggingface.co
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PremiLab-Math (2024). MathCheck [Dataset]. https://huggingface.co/datasets/PremiLab-Math/MathCheck
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 12, 2024
Dataset authored and provided by
PremiLab-Math
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a substantial risk of model overfitting and fails to accurately represent genuine mathematical… See the full description on the dataset page: https://huggingface.co/datasets/PremiLab-Math/MathCheck.
MetaMath QA
kaggle.com
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MetaMath QA

Mathematical Questions for Large Language Models

By Huggingface Hub [source]

About this dataset

This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Data Dictionary

The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

Preparing data for analysis

It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

##### Training Models using Mistral 7B

Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

##### Testing phosphors :

After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

Research Ideas

Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.

Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.

Optimizing search algorithms that surface relevant answer results based on types of queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Data from: The IBEM Dataset: a large printed scientific image dataset for...
zenodo.org
zip
Updated May 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Anitei; Dan Anitei; Joan Andreu Sánchez; Joan Andreu Sánchez; José Miguel Benedí; José Miguel Benedí (2023). The IBEM Dataset: a large printed scientific image dataset for indexing and searching mathematical expressions [Dataset]. http://doi.org/10.5281/zenodo.7963703
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7963703
Dataset updated
May 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dan Anitei; Dan Anitei; Joan Andreu Sánchez; Joan Andreu Sánchez; José Miguel Benedí; José Miguel Benedí
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The IBEM dataset consists of 600 documents with a total number of 8272 pages, containing 29603 isolated and 137089 embedded Mathematical Expressions (MEs). The objective of the IBEM dataset is to facilitate the indexing and searching of MEs in massive collections of STEM documents. The dataset was built by parsing the LaTeX source files of documents from the KDD Cup Collection. Several experiments can be carried out with the IBEM dataset ground-truth (GT): ME detection and extraction, ME recognition, etc.

The dataset consists of the following files:

“IBEM.json”: file containing the IBEM GT information. The data is firstly organized by pages, then by the type of expression (“embedded” or “displayed”), and lastly by the GT of each individual ME. For each ME we provide:

xy page-level coordinates, reported as relative (%) to the width/height of the page image.

“split” attribute indicating the number of fragments in which the ME has been split. MEs can be split over various lines, columns or pages. The LaTeX transcript of split MEs have been exactly replicated (entire LaTeX definition) for each fragment.

“latex” original transcript as extracted from the LaTeX source files of the documents. This definition can contain user-defined macros. In order to be able to compile these expressions, each page includes the preamble of the source files containing the defined macros and the packages used by the authors of the documents.

“latex_expand” transcript reconstructed from the output stream of the LuaLaTeX engine in which user-defined macros have been expanded. The transcript has the same visual representation as the original transcript, with the addition that the LaTeX definitions are tokenized, the order of sub/super script elements have been fixed, and matrices have been transformed to arrays.

“latex_norm” transcript resulting from applying an extra normalization process to the “latex_expand” expression. This normalization process includes removing font information such as slant, style, and weight.

“partitions/*.lst”: files containing list of pages forming the partition sets.

“pages/*.jpg”: individual pages extracted from the documents.

The dataset is partitioned into various sets as provided for the ICDAR 2021 Competition on Mathematical Formula Detection. The ground-truth related to this competition, which is included in this dataset version, can also be found here. More information about the competition can be found in the following paper:

D. Anitei, J.A. Sánchez, J.M. Fuentes, R. Paredes, and J.M. Benedí. ICDAR 2021 Competition on Mathematical Formula Detection. In ICDAR, pages 783–795, 2021.

For ME recognition tasks, we recommend rendering the “latex_expand” version of the formulae in order to create standalone expressions that have the same visual representation as MEs found in the original documents (see attached python script “extract_GT.py”). Extracting MEs from the documents based on coordinates is more complex, as special care is needed to concatenate the fragments of split expressions. Baseline results for ME recognition tasks will soon be made available.
P
NaturalProofs Dataset
paperswithcode.com
opendatalab.com
+2more
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean Welleck; Jiacheng Liu; Ronan Le Bras; Hannaneh Hajishirzi; Yejin Choi; Kyunghyun Cho (2025). NaturalProofs Dataset [Dataset]. https://paperswithcode.com/dataset/naturalproofs
Explore at:
Dataset updated
May 28, 2025
Authors
Sean Welleck; Jiacheng Liu; Ronan Le Bras; Hannaneh Hajishirzi; Yejin Choi; Kyunghyun Cho
Description
The NaturalProofs Dataset is a large-scale dataset for studying mathematical reasoning in natural language. NaturalProofs consists of roughly 20,000 theorem statements and proofs, 12,500 definitions, and 1,000 additional pages (e.g. axioms, corollaries) derived from ProofWiki, an online compendium of mathematical proofs written by a community of contributors.
D
Comparative Judgement of Statements About Mathematical Definitions
dataverse.no
dataverse.azure.uit.no
csv, txt
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tore Forbregd; Tore Forbregd; Hermund Torkildsen; Eivind Kaspersen; Trygve Solstad; Hermund Torkildsen; Eivind Kaspersen; Trygve Solstad (2023). Comparative Judgement of Statements About Mathematical Definitions [Dataset]. http://doi.org/10.18710/EOZKTR
Explore at:
txt(3623), csv(2523), csv(37503), csv(43566)Available download formats
Unique identifier
https://doi.org/10.18710/EOZKTR
Dataset updated
Sep 28, 2023
Dataset provided by
DataverseNO
Authors
Tore Forbregd; Tore Forbregd; Hermund Torkildsen; Eivind Kaspersen; Trygve Solstad; Hermund Torkildsen; Eivind Kaspersen; Trygve Solstad
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data from a comparative judgement survey consisting of 62 working mathematics educators (ME) at Norwegian universities or city colleges, and 57 working mathematicians at Norwegian universities. A total of 3607 comparisons of which 1780 comparisons by the ME and 1827 ME. The comparative judgement survey consisted of respondents comparing pairs of statements on mathematical definitions compiled from a literature review on mathematical definitions in the mathematics education literature. Each WM was asked to judge 40 pairs of statements with the following question: “As a researcher in mathematics, where your target group is other mathematicians, what is more important about mathematical definitions?” Each ME was asked to judge 41 pairs of statements with the following question: “For a mathematical definition in the context of teaching and learning, what is more important?” The comparative judgement was done with No More Marking software (nomoremarking.com) The data set consists of the following data: comparisons made by ME (ME.csv) comparisons made by WM (WM.csv) Look up table of codes of statements and statement formulations (key.csv) Each line in the comparison represents a comparison, where the "winner" column represents the winner and the "loser" column the loser of the comparison.
P
MML Dataset
paperswithcode.com
Updated Feb 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt (2022). MML Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
Explore at:
Dataset updated
Feb 15, 2022
Authors
Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
Description
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
Z
SCG Dataset from Graph Neural Networks in Supply Chain Analytics and...
data.niaid.nih.gov
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasi, Azmine Toushik (2024). SCG Dataset from Graph Neural Networks in Supply Chain Analytics and Optimization: Concepts, Perspectives, Dataset and Benchmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13652825
Explore at:
Dataset updated
Sep 3, 2024
Dataset provided by
Wasi, Azmine Toushik
Akib, Adipto Raihan
Bappy, Mahathir Mohammad
Islam, MD Shafikul
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: Graph Neural Networks (GNNs) have recently gained traction in transportation, bioinformatics, language and image processing, but research on their application to supply chain management remains limited. Supply chains are inherently graph-like, making them ideal for GNN methodologies, which can optimize and solve complex problems. The barriers include a lack of proper conceptual foundations, familiarity with graph applications in SCM, and real-world benchmark datasets for GNN-based supply chain research. To address this, we discuss and connect supply chains with graph structures for effective GNN application, providing detailed formulations, examples, mathematical definitions, and task guidelines. Additionally, we present a multi-perspective real-world benchmark dataset from a leading FMCG company in Bangladesh, focusing on supply chain planning. We discuss various supply chain tasks using GNNs and benchmark several state-of-the-art models on homogeneous and heterogeneous graphs across six supply chain analytics tasks. Our analysis shows that GNN-based models consistently outperform statistical ML and other deep learning models by around 10-30% in regression, 10-30% in classification and detection tasks, and 15-40% in anomaly detection tasks on designated metrics. With this work, we lay the groundwork for solving supply chain problems using GNNs, supported by conceptual discussions, methodological insights, and a comprehensive dataset.
P
GPQA Dataset
paperswithcode.com
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman (2025). GPQA Dataset [Dataset]. https://paperswithcode.com/dataset/gpqa
Explore at:
Dataset updated
Jan 30, 2025
Authors
David Rein; Betty Li Hou; Asa Cooper Stickland; Jackson Petty; Richard Yuanzhe Pang; Julien Dirani; Julian Michael; Samuel R. Bowman
Description
GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:

Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult. Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect). Google-Proof: The questions are "Google-proof," meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers. AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

The difficulty of GPQA for both skilled non-experts and cutting-edge AI systems makes it an excellent resource for conducting realistic scalable oversight experiments. These experiments aim to explore ways for human experts to reliably obtain truthful information from AI systems that surpass human capabilities¹³.

In summary, GPQA serves as a valuable benchmark for assessing the robustness and limitations of language models, especially when faced with complex and nuanced questions. Its difficulty level encourages research into effective oversight methods, bridging the gap between AI and human expertise.

(1) [2311.12022] GPQA: A Graduate-Level Google-Proof Q&A Benchmark - arXiv.org. https://arxiv.org/abs/2311.12022. (2) GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Klu. https://klu.ai/glossary/gpqa-eval. (3) GPA Dataset (Spring 2010 through Spring 2020) - Data Science Discovery. https://discovery.cs.illinois.edu/dataset/gpa/. (4) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - GitHub. https://github.com/idavidrein/gpqa. (5) Data Sets - OpenIntro. https://www.openintro.org/data/index.php?data=satgpa. (6) undefined. https://doi.org/10.48550/arXiv.2311.12022. (7) undefined. https://arxiv.org/abs/2311.12022%29.
f
Data Sheet 1_Mathematical methodology for defining a frequent attender...
frontiersin.figshare.com
pdf
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Williams; Syaribah N. Brice; Dave Price (2025). Data Sheet 1_Mathematical methodology for defining a frequent attender within emergency departments.pdf [Dataset]. http://doi.org/10.3389/femer.2025.1462764.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/femer.2025.1462764.s001
Dataset updated
Feb 11, 2025
Dataset provided by
Frontiers
Authors
Elizabeth Williams; Syaribah N. Brice; Dave Price
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveEmergency department (ED) frequent attenders (FA) have been the subject of discussion in many countries. This group of patients have contributed to the high expenses of health services and strained capacity in the department. Studies related to ED FAs aim to describe the characteristics of patients such as demographic and socioeconomic factors. The analysis may explore the relationship between these factors and multiple patient visits. However, the definition used for classifying patients varies across studies. While most studies used frequency of attendance to define the FA, the derivation of the frequency is not clear.MethodsWe propose a mathematical methodology to define the time interval between ED returns for classifying FAs. K-means clustering and the Elbow method were used to identify suitable FA definitions. Recursive clustering on the smallest time interval cluster created a new, smaller cluster and formal FA definition.ResultsApplied to a case study dataset of approximately 336,000 ED attendances, this framework can consistently and effectively identify FAs across EDs. Based on our data, a FA is defined as a patient with three or more attendances within sequential 21-day periods.ConclusionThis study introduces a standardized framework for defining ED FAs, providing a consistent and effective means of identification across different EDs. Furthermore, the methodology can be used to identify patients who are at risk of becoming a FA. This allows for the implementation of targeted interventions aimed at reducing the number of future attendances.
f
Do Humans Optimally Exploit Redundancy to Control Step Variability in...
figshare.com
data.niaid.nih.gov
+1more
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan B. Dingwell; Joby John; Joseph P. Cusumano (2023). Do Humans Optimally Exploit Redundancy to Control Step Variability in Walking? [Dataset]. http://doi.org/10.1371/journal.pcbi.1000856
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1000856
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS Computational Biology
Authors
Jonathan B. Dingwell; Joby John; Joseph P. Cusumano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It is widely accepted that humans and animals minimize energetic cost while walking. While such principles predict average behavior, they do not explain the variability observed in walking. For robust performance, walking movements must adapt at each step, not just on average. Here, we propose an analytical framework that reconciles issues of optimality, redundancy, and stochasticity. For human treadmill walking, we defined a goal function to formulate a precise mathematical definition of one possible control strategy: maintain constant speed at each stride. We recorded stride times and stride lengths from healthy subjects walking at five speeds. The specified goal function yielded a decomposition of stride-to-stride variations into new gait variables explicitly related to achieving the hypothesized strategy. Subjects exhibited greatly decreased variability for goal-relevant gait fluctuations directly related to achieving this strategy, but far greater variability for goal-irrelevant fluctuations. More importantly, humans immediately corrected goal-relevant deviations at each successive stride, while allowing goal-irrelevant deviations to persist across multiple strides. To demonstrate that this was not the only strategy people could have used to successfully accomplish the task, we created three surrogate data sets. Each tested a specific alternative hypothesis that subjects used a different strategy that made no reference to the hypothesized goal function. Humans did not adopt any of these viable alternative strategies. Finally, we developed a sequence of stochastic control models of stride-to-stride variability for walking, based on the Minimum Intervention Principle. We demonstrate that healthy humans are not precisely “optimal,” but instead consistently slightly over-correct small deviations in walking speed at each stride. Our results reveal a new governing principle for regulating stride-to-stride fluctuations in human walking that acts independently of, but in parallel with, minimizing energetic cost. Thus, humans exploit task redundancies to achieve robust control while minimizing effort and allowing potentially beneficial motor variability.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

PremiLab-Math (2024). MathCheck [Dataset]. https://huggingface.co/datasets/PremiLab-Math/MathCheck

Data from: MathCheck

PremiLab-Math/MathCheck

Explore at:

102 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 12, 2024

Dataset authored and provided by

PremiLab-Math

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a substantial risk of model overfitting and fails to accurately represent genuine mathematical… See the full description on the dataset page: https://huggingface.co/datasets/PremiLab-Math/MathCheck.

Clear search

Close search

Google apps

Main menu

Data from: MathCheck

MetaMath QA

MetaMath QA

Mathematical Questions for Large Language Models

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Data Dictionary

Preparing data for analysis

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data from: The IBEM Dataset: a large printed scientific image dataset for...

NaturalProofs Dataset

Comparative Judgement of Statements About Mathematical Definitions

MML Dataset

SCG Dataset from Graph Neural Networks in Supply Chain Analytics and...

GPQA Dataset

Data Sheet 1_Mathematical methodology for defining a frequent attender...

Do Humans Optimally Exploit Redundancy to Control Step Variability in...

Data from: MathCheck

PremiLab-Math/MathCheck