79 datasets found

Mathematics Dataset
github.com
opendatalab.com
+1more
Updated Apr 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepMind (2019). Mathematics Dataset [Dataset]. https://github.com/Wikidepia/mathematics_dataset_id
Explore at:
Dataset updated
Apr 3, 2019
Dataset provided by
DeepMindhttp://deepmind.com/
Description
This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

## Example questions

Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r. Answer: 4 Question: Calculate -841880142.544 + 411127. Answer: -841469015.544 Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)). Answer: 54*a - 30

It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:

algebra (linear equations, polynomial roots, sequences)

arithmetic (pairwise operations and mixed expressions, surds)

calculus (differentiation)

comparison (closest numbers, pairwise comparisons, sorting)

measurement (conversion, working with time)

numbers (base conversion, remainders, common divisors and multiples, primality, place value, rounding numbers)

polynomials (addition, simplification, composition, evaluating, expansion)

probability (sampling without replacement)
T
math_dataset
tensorflow.org
huggingface.co
Updated Jan 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). math_dataset [Dataset]. https://www.tensorflow.org/datasets/catalog/math_dataset
Explore at:
Dataset updated
Jan 4, 2023
Description
Mathematics database.

This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

Original paper: Analysing Mathematical Reasoning Abilities of Neural Models (Saxton, Grefenstette, Hill, Kohli).

Example usage:

train_examples, val_examples = tfds.load( 'math_dataset/arithmetic_mul', split=['train', 'test'], as_supervised=True)

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('math_dataset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
MathInstruct Dataset: Hybrid Math Instruction
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MathInstruct Dataset: Hybrid Math Instruction [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathinstruct-dataset-hybrid-math-instruction-tun
Explore at:
zip(60239940 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MathInstruct Dataset: Hybrid Math Instruction Tuning

A curated dataset for math instruction tuning models

By TIGER-Lab (From Huggingface) [source]

About this dataset

MathInstruct is a comprehensive and meticulously curated dataset specifically designed to facilitate the development and evaluation of models for math instruction tuning. This dataset consists of a total of 13 different math rationale datasets, out of which six have been exclusively curated for this project, ensuring a diverse range of instructional materials. The main objective behind creating this dataset is to provide researchers with an easily accessible and manageable resource that aids in enhancing the effectiveness and precision of math instruction.

One noteworthy feature of MathInstruct is its lightweight nature, making it highly convenient for researchers to utilize without any hassle. With carefully selected columns such as source, source, output, output, users can readily identify the origin or reference material from where the math instruction was obtained. Additionally, they can also refer to the expected output or solution corresponding to each specific math problem or exercise.

Overall, MathInstruct offers immense potential in refining hybrid math instruction by facilitating meticulous model development and rigorous evaluation processes. Researchers can leverage this diverse dataset to gain deeper insights into effective teaching methodologies while exploring innovative approaches towards enhancing mathematical learning experiences

How to use the dataset

Title: How to Use the MathInstruct Dataset for Hybrid Math Instruction Tuning

Introduction: The MathInstruct dataset is a comprehensive collection of math instruction examples, designed to assist in developing and evaluating models for math instruction tuning. This guide will provide an overview of the dataset and explain how to make effective use of it.

Understanding the Dataset Structure: The dataset consists of a file named train.csv. This CSV file contains the training data, which includes various columns such as source and output. The source column represents the source of math instruction (textbook, online resource, or teacher), while the output column represents expected output or solution to a particular math problem or exercise.

Accessing the Dataset: To access the MathInstruct dataset, you can download it from Kaggle's website. Once downloaded, you can read and manipulate the data using programming languages like Python with libraries such as pandas.

Exploring the Columns: a) Source Column: The source column provides information about where each math instruction comes from. It may include references to specific textbooks, online resources, or even teachers who provided instructional material. b) Output Column: The output column specifies what students are expected to achieve as a result of each math instruction. It contains solutions or expected outputs for different math problems or exercises.

Utilizing Source Information: By analyzing the different sources mentioned in this dataset, researchers can understand which instructional materials are more effective in teaching specific topics within mathematics. They can also identify common strategies used by teachers across multiple sources.

Analyzing Expected Outputs: Researchers can study variations in expected outputs for similar types of problems across different sources. This analysis may help identify differences in approaches across textbooks/resources and enrich our understanding of various teaching methods.

Model Development and Evaluation: Researchers can utilize this dataset to develop machine learning models that automatically assess whether a given math instruction leads to the expected output. By training models on this data, one can create automated systems that provide feedback on math problems or suggest alternative instruction sources.

Scaling the Dataset: Due to its lightweight nature, the MathInstruct dataset is easily accessible and manageable. Researchers can scale up their training data by combining it with other instructional datasets or expand it further by labeling more examples based on similar guidelines.

Conclusion: The MathInstruct dataset serves as a valuable resource for developing and evaluating models related to math instruction tuning. By analyzing the source information and expected outputs, researchers can gain insights into effective teaching methods and build automated assessment

Research Ideas

Model development: This dataset can be used for developing and training models for math instruction...
p
Trends in Math Proficiency (2012-2023): Range View Elementary School vs....
publicschoolreview.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Math Proficiency (2012-2023): Range View Elementary School vs. Colorado vs. Weld County Reorganized School District No. Re-4 [Dataset]. https://www.publicschoolreview.com/range-view-elementary-school-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual math proficiency from 2012 to 2023 for Range View Elementary School vs. Colorado and Weld County Reorganized School District No. Re-4
h
Advanced-Math
huggingface.co
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
haijian (2025). Advanced-Math [Dataset]. https://huggingface.co/datasets/haijian06/Advanced-Math
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2025
Authors
haijian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Here's a concise README for your Advanced-Math dataset:

Advanced-Math Dataset

This Advanced-Math dataset is designed to support advanced studies and research in various mathematical fields. It encompasses a wide range of topics, including:

Calculus Linear Algebra Probability Machine Learning Deep Learning

The dataset primarily focuses on computational problems, which constitute over 80% of the content. Additionally, it includes related logical concept questions to provide a… See the full description on the dataset page: https://huggingface.co/datasets/haijian06/Advanced-Math.
Prime gap frequency distribution (powers of 2)
kaggle.com
zip
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erick Magyar (2025). Prime gap frequency distribution (powers of 2) [Dataset]. https://www.kaggle.com/datasets/erickmagyar/prime-gap-frequency-distribution-powers-of-2
Explore at:
zip(5860739 bytes)Available download formats
Dataset updated
Mar 26, 2025
Authors
Erick Magyar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description: A Deep Dive into Prime Gap Distribution and Primorial Harmonics Overview: This dataset offers a comprehensive exploration of prime gap distribution, focusing on the intriguing patterns associated with primorials and their harmonics. Primorials, the product of the first n prime numbers, play a significant role in shaping the landscape of prime gaps. By analyzing the distribution of prime gaps and their relation to primorials, we can gain deeper insights into the fundamental structure of prime numbers. Data Structure: * Power of 2: The base-2 exponent. * Gap Size N: The size of the Nth prime gap following the given power of 2. Key Features: * Primorial Harmonics: The dataset highlights the appearance of prime gaps that are multiples of primorials, suggesting a deeper connection between these numbers and the distribution of primes. * Large Prime Gaps: The dataset includes information on exceptionally large prime gaps, which can provide valuable clues about the underlying structure of the number line. * Prime Number Distribution: The distribution of prime numbers within the specified range is analyzed, revealing patterns and anomalies. Potential Applications: * Number Theory Research: * Investigating the role of primorials in shaping prime gap distribution. * Testing conjectures related to the Riemann Hypothesis and the Twin Prime Conjecture. * Exploring the connection between prime gaps and other mathematical concepts, such as modular arithmetic and number theory functions. * Machine Learning and Data Science: * Training machine learning models to predict prime gap sizes, incorporating primorials as features. * Developing algorithms to identify and analyze primorial-related patterns. * Computational Mathematics: * Benchmarking computational resources and algorithms for prime number generation and factorization. * Developing new algorithms for efficient computation of primorials and their harmonics. How to Use This Dataset: * Data Exploration: * Visualize the distribution of prime gaps, highlighting the occurrence of primorial harmonics. * Analyze the frequency of different gap sizes, focusing on multiples of primorials. * Study the relationship between prime gap size and the corresponding power of 2, considering the influence of primorials. * Machine Learning: * Incorporate features related to primorials and their harmonics into machine learning models. * Experiment with different feature engineering techniques and hyperparameter tuning to improve model performance. * Use the dataset to train models that can predict the occurrence of large prime gaps and other significant patterns. * Number Theory Research: * Use the dataset to formulate and test new conjectures about the distribution of prime gaps and the role of primorials. * Explore the connection between prime gap distribution and other mathematical fields, such as cryptography and coding theory. By leveraging this dataset, researchers can gain a deeper understanding of the intricate patterns and underlying structures that govern the distribution of prime numbers.

Supplement to the Prime Gap Dataset Description Unveiling the Mysteries of Prime Gaps The Prime Gap Dataset offers a unique opportunity to delve into the fascinating world of prime numbers. By analyzing the distribution of gaps between consecutive primes, we can uncover hidden patterns and structures that might hold the key to unlocking the secrets of the universe. Key Features and Potential Insights: * Visual Exploration: Immerse yourself in stunning visualizations of prime gap distributions, revealing hidden patterns and anomalies. * Statistical Analysis: Conduct in-depth statistical analysis to identify trends, correlations, and outliers. * Machine Learning Applications: Employ machine learning techniques to predict prime gap distributions and discover novel insights. * Fractal Analysis: Investigate the potential fractal nature of prime number distributions, revealing self-similarity at different scales. Potential Research Directions: * Uncovering Hidden Patterns: Explore the distribution of prime gaps at various scales to identify emerging patterns and structures. * Predicting Prime Gap Behavior: Develop machine learning models to predict the size and distribution of future prime gaps. * Testing Mathematical Conjectures: Use the dataset to test conjectures related to prime number distribution, such as the Riemann Hypothesis. * Exploring Connections to Other Fields: Investigate the relationship between prime numbers and other mathematical fields, such as chaos theory and information theory. By delving into this rich dataset, you can contribute to the ongoing exploration of one of the most fundamental and enduring mysteries of mathematics.
Prime Number Source Code with Dataset
figshare.com
zip
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayman Mostafa (2024). Prime Number Source Code with Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27215508.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27215508.v1
Dataset updated
Oct 12, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ayman Mostafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper addresses the computational methods and challenges associated with prime number generation, a critical component in encryption algorithms for ensuring data security. The generation of prime numbers efficiently is a critical challenge in various domains, including cryptography, number theory, and computer science. The quest to find more effective algorithms for prime number generation is driven by the increasing demand for secure communication and data storage and the need for efficient algorithms to solve complex mathematical problems. Our goal is to address this challenge by presenting two novel algorithms for generating prime numbers: one that generates primes up to a given limit and another that generates primes within a specified range. These innovative algorithms are founded on the formulas of odd-composed numbers, allowing them to achieve remarkable performance improvements compared to existing prime number generation algorithms. Our comprehensive experimental results reveal that our proposed algorithms outperform well-established prime number generation algorithms such as Miller-Rabin, Sieve of Atkin, Sieve of Eratosthenes, and Sieve of Sundaram regarding mean execution time. More notably, our algorithms exhibit the unique ability to provide prime numbers from range to range with a commendable performance. This substantial enhancement in performance and adaptability can significantly impact the effectiveness of various applications that depend on prime numbers, from cryptographic systems to distributed computing. By providing an efficient and flexible method for generating prime numbers, our proposed algorithms can develop more secure and reliable communication systems, enable faster computations in number theory, and support advanced computer science and mathematics research.

Mathematical Mathematics Memes

kaggle.com

zip

Updated Oct 19, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdelghani Belgaid (2021). Mathematical Mathematics Memes [Dataset]. https://www.kaggle.com/abdelghanibelgaid/mathematical-mathematics-memes

Explore at:

zip(403690378 bytes)Available download formats

Dataset updated

Oct 19, 2021

Authors

Abdelghani Belgaid

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Introducing the Mathematics Meme Repository

<h2>Abstract</h2>

<p>In the era of digital communication, memes have become a potent medium for conveying ideas, humor, and cultural references. This paper introduces the “Mathematical Mathematics Memes Dataset”, a comprehensive collection of over 10,000 math-related memes sourced from the “Mathematical Mathematics Memes” Facebook group. These memes offer a unique perspective on the intersection of mathematics and humor. We discuss the dataset's origins, content, and potential applications, including meme generation, abusive meme detection, and text extraction for popularity prediction. This dataset serves as a valuable resource for researchers and meme enthusiasts interested in exploring the realm of mathematical memes.</p>

<p><strong>Keywords:</strong> Mathematical Memes, Dataset.</p>

<h2>1. Introduction</h2>

<p>The advent of internet culture has given rise to a vast array of digital content, and memes have emerged as a prominent and influential form of online expression. Memes encompass various themes, including humor, satire, education, and mathematical concepts. In this context, the “Mathematical Mathematics Memes Dataset” stands as a unique collection, focusing on memes related to college-level mathematics and beyond.</p>

<h2>2. Literature Review</h2>

<p>Memes have been the subject of increasing academic interest due to their cultural significance and impact on online discourse. Existing literature in meme analysis primarily focuses on:</p>

<ul>
  <li><strong>Meme Classification:</strong> Scholars have explored methods for categorizing memes based on content, humor type, and cultural references.</li>
  <li><strong>Meme Virality:</strong> Researchers have examined factors contributing to meme virality, such as content novelty, relatability, and emotional resonance.</li>
  <li><strong>Meme Detection:</strong> Algorithms have been developed to detect offensive or abusive memes, contributing to online safety and content moderation.</li>
  <li><strong>Meme Generation:</strong> With the rise of AI, meme generation has also gained attention. Researchers have explored methods for automatically generating memes, including text-based meme generation. AI generative models like ChatGPT have been used to create memes that are contextually relevant and humorous. The “Mathematical Mathematics Memes Dataset” not only provides a rich source of math-related memes but also serves as a valuable resource for studying and improving meme generation algorithms, including those that incorporate mathematical concepts.</li>
</ul>

<h2>3. Dataset Description</h2>

<p><strong>3.1. Data Source:</strong> The dataset can be accessed on Kaggle through the following link: <a href="https://www.kaggle.com/datasets/abdelghanibelgaid/mathematical-mathematics-memes">Mathematical Mathematics Memes Dataset</a>.

</p><p><strong>3.2. Content:</strong> The memes in this dataset cover a wide range of mathematical topics and themes. From clever algebraic jokes to humorous calculus references, this collection captures the creativity and wit of the mathematical community.</p>

<h2>4. Applications</h2>

<ul>
  <li><strong>Generate High-Quality Math Memes:</strong> Creators can use this dataset to gain insights into the structure and content of successful mathematical memes, enabling the generation of high-quality, engaging content. AI generative models like ChatGPT can be employed to assist in meme creation, leveraging the dataset to produce contextually relevant and humorous mathematical memes.</li>
  <li><strong>Text Extraction and Popularity Prediction:</strong> Exploring the extraction of text from memes and predicting their popularity based on content can contribute to our understanding of virality in online content and be a valuable tool for meme creators and marketers seeking to optimize their creations.</li>
  <li><strong>Detect Hateful or Abusive Memes:</strong> Researchers and developers can employ this dataset to develop algorithms and models for the automatic detection of harmful or abusive content, ensuring a safer online environment.</li>
</ul>

<h2>5. Conclusion</h2>

<p>The "Mathematical Mathematics Memes Dataset" offers a valuable resource for researchers and meme enthusiasts, presenting a unique perspective on mathematics and humor. As the digital landscape continues to evolve, understanding the dynamics of mathematical memes can provide insights into online culture and communication. This dataset paves the way for future research into meme classification, virality, content moderation, and AI-assisted meme generation in the context of mathematics, contributing to a deeper understanding of online meme culture.</p>

<h2>Copyright Information</h2>

<p>The c...

n
Data from: Overcoming the challenge of small effective sample sizes in...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated Sep 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christen H. Fleming; Michael J. Noonan; Emilia Patricia Medici; Justin M. Calabrese (2019). Overcoming the challenge of small effective sample sizes in home-range estimation [Dataset]. http://doi.org/10.5061/dryad.16bc7f2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.16bc7f2
Dataset updated
Sep 8, 2019
Authors
Christen H. Fleming; Michael J. Noonan; Emilia Patricia Medici; Justin M. Calabrese
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Brazil, Pantanal
Description
Technological advances have steadily increased the detail of animal tracking datasets, yet fundamental data limitations exist for many species that cause substantial biases in home‐range estimation. Specifically, the effective sample size of a range estimate is proportional to the number of observed range crossings, not the number of sampled locations. Currently, the most accurate home‐range estimators condition on an autocorrelation model, for which the standard estimation frame‐works are based on likelihood functions, even though these methods are known to underestimate variance—and therefore ranging area—when effective sample sizes are small. Residual maximum likelihood (REML) is a widely used method for reducing bias in maximum‐likelihood (ML) variance estimation at small sample sizes. Unfortunately, we find that REML is too unstable for practical application to continuous‐time movement models. When the effective sample size N is decreased to N ≤ urn:x-wiley:2041210X:media:mee313270:mee313270-math-0001(10), which is common in tracking applications, REML undergoes a sudden divergence in variance estimation. To avoid this issue, while retaining REML’s first‐order bias correction, we derive a family of estimators that leverage REML to make a perturbative correction to ML. We also derive AIC values for REML and our estimators, including cases where model structures differ, which is not generally understood to be possible. Using both simulated data and GPS data from lowland tapir (Tapirus terrestris), we show how our perturbative estimators are more accurate than traditional ML and REML methods. Specifically, when urn:x-wiley:2041210X:media:mee313270:mee313270-math-0002(5) home‐range crossings are observed, REML is unreliable by orders of magnitude, ML home ranges are ~30% underestimated, and our perturbative estimators yield home ranges that are only ~10% underestimated. A parametric bootstrap can then reduce the ML and perturbative home‐range underestimation to ~10% and ~3%, respectively. Home‐range estimation is one of the primary reasons for collecting animal tracking data, and small effective sample sizes are a more common problem than is currently realized. The methods introduced here allow for more accurate movement‐model and home‐range estimation at small effective sample sizes, and thus fill an important role for animal movement analysis. Given REML’s widespread use, our methods may also be useful in other contexts where effective sample sizes are small.
n
Data from: Correcting for missing and irregular data in home-range...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jan 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christen H. Fleming; Daniel Sheldon; William F. Fagan; Peter Leimgruber; Thomas Mueller; Dejid Nandintsetseg; Michael J. Noonan; Kirk A. Olson; Edy Setyawan; Abraham Sianipar; Justin M. Calabrese (2018). Correcting for missing and irregular data in home-range estimation [Dataset]. http://doi.org/10.5061/dryad.n42h0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.n42h0
Dataset updated
Jan 9, 2018
Dataset provided by
University of Tasmania
Conservation International Indonesia; Marine Program; Jalan Pejaten Barat 16A, Kemang Jakarta DKI Jakarta 12550 Indonesia
Smithsonian Conservation Biology Institute
University of Maryland, College Park
University of Massachusetts Amherst
Goethe University Frankfurt
Authors
Christen H. Fleming; Daniel Sheldon; William F. Fagan; Peter Leimgruber; Thomas Mueller; Dejid Nandintsetseg; Michael J. Noonan; Kirk A. Olson; Edy Setyawan; Abraham Sianipar; Justin M. Calabrese
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Mongolia
Description
Home-range estimation is an important application of animal tracking data that is frequently complicated by autocorrelation, sampling irregularity, and small effective sample sizes. We introduce a novel, optimal weighting method that accounts for temporal sampling bias in autocorrelated tracking data. This method corrects for irregular and missing data, such that oversampled times are downweighted and undersampled times are upweighted to minimize error in the home-range estimate. We also introduce computationally efficient algorithms that make this method feasible with large datasets. Generally speaking, there are three situations where weight optimization improves the accuracy of home-range estimates: with marine data, where the sampling schedule is highly irregular, with duty cycled data, where the sampling schedule changes during the observation period, and when a small number of home-range crossings are observed, making the beginning and end times more independent and informative than the intermediate times. Using both simulated data and empirical examples including reef manta ray, Mongolian gazelle, and African buffalo, optimal weighting is shown to reduce the error and increase the spatial resolution of home-range estimates. With a conveniently packaged and computationally efficient software implementation, this method broadens the array of datasets with which accurate space-use assessments can be made.
p
Range View Elementary School
publicschoolreview.com
json, xml
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Range View Elementary School [Dataset]. https://www.publicschoolreview.com/range-view-elementary-school-profile
Explore at:
xml, jsonAvailable download formats
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2011 - Dec 31, 2025
Description
Historical Dataset of Range View Elementary School is provided by PublicSchoolReview and contain statistics on metrics:Total Students Trends Over Years (2013-2023),Total Classroom Teachers Trends Over Years (2013-2023),Distribution of Students By Grade Trends,Student-Teacher Ratio Comparison Over Years (2013-2023),American Indian Student Percentage Comparison Over Years (2011-2023),Asian Student Percentage Comparison Over Years (2021-2022),Hispanic Student Percentage Comparison Over Years (2013-2023),Black Student Percentage Comparison Over Years (2019-2022),White Student Percentage Comparison Over Years (2013-2023),Two or More Races Student Percentage Comparison Over Years (2013-2023),Diversity Score Comparison Over Years (2013-2023),Free Lunch Eligibility Comparison Over Years (2013-2023),Reduced-Price Lunch Eligibility Comparison Over Years (2013-2023),Reading and Language Arts Proficiency Comparison Over Years (2011-2022),Math Proficiency Comparison Over Years (2012-2023),Overall School Rank Trends Over Years (2012-2023)
f
Data from: SyMANTIC: An Efficient Symbolic Regression Method for...
acs.figshare.com
zip
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madhav R. Muthyala; Farshud Sorourifar; You Peng; Joel A. Paulson (2025). SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond [Dataset]. http://doi.org/10.1021/acs.iecr.4c03503.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.iecr.4c03503.s001
Dataset updated
Feb 4, 2025
Dataset provided by
ACS Publications
Authors
Madhav R. Muthyala; Farshud Sorourifar; You Peng; Joel A. Paulson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from ∼105 to ∼1010 or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied l0-based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.
h
AlgebraicEquationsGenerator
huggingface.co
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Shortland (2025). AlgebraicEquationsGenerator [Dataset]. http://doi.org/10.57967/hf/5121
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5121
Dataset updated
Apr 12, 2025
Authors
Michael Shortland
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
/**

Algebraic Equation Dataset Generator for Hugging Face

This script generates diverse datasets of algebraic equations with their solutions, producing different valid equations each time it's run, properly formatted for Hugging Face. */

// Utility function to generate a random integer within a range function getRandomInt(min, max) { return Math.floor(Math.random() * (max - min + 1)) + min; } // Utility function to get a random non-zero integer within a range function… See the full description on the dataset page: https://huggingface.co/datasets/BarefootMikeOfHorme/AlgebraicEquationsGenerator.
Meta data and supporting documentation
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
p
Data from: MIMIC-IV-Ext-Instr: A Dataset of 450K+ EHR-Grounded...
physionet.org
Updated Sep 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhenbang Wu; Anant Dadu; Mike Nalls; Faraz Faghri; Jimeng Sun (2025). MIMIC-IV-Ext-Instr: A Dataset of 450K+ EHR-Grounded Instruction-Following Examples [Dataset]. http://doi.org/10.13026/e5bq-pr14
Explore at:
Unique identifier
https://doi.org/10.13026/e5bq-pr14
Dataset updated
Sep 9, 2025
Authors
Zhenbang Wu; Anant Dadu; Mike Nalls; Faraz Faghri; Jimeng Sun
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large language models (LLMs) have shown impressive capabilities in solving a wide range of tasks based on human instructions. However, developing a conversational AI assistant for electronic health record (EHR) data remains challenging due to the lack of large-scale instruction-following datasets. To address this, we present MIMIC-IV-Ext-Instr, a dataset containing over 450K open-ended, instruction-following examples generated using GPT-3.5 on a HIPAA-compliant platform. Derived from the MIMIC-IV EHR database, MIMIC-IV-Ext-Instr spans a wide range of topics and is specifically designed to support instruction-tuning of general-purpose LLMs for diverse clinical applications.
d
Data from: Twitter Big Data as A Resource For Exoskeleton Research: A...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thakur, Nirmalya (2023). Twitter Big Data as A Resource For Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions [Dataset]. http://doi.org/10.7910/DVN/VPPTRF
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VPPTRF
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Thakur, Nirmalya
Description
Please cite the following paper when using this dataset: N. Thakur, “Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions,” Preprints, 2022, DOI: 10.20944/preprints202206.0383.v1 Abstract The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and use cases in assisted living, military, healthcare, firefighting, and industries. With the projected increase in the diverse uses of exoskeletons in the next few years in these application domains and beyond, it is crucial to study, interpret, and analyze user perspectives, public opinion, reviews, and feedback related to exoskeletons, for which a dataset is necessary. The Internet of Everything era of today's living, characterized by people spending more time on the Internet than ever before, holds the potential for developing such a dataset by mining relevant web behavior data from social media communications, which have increased exponentially in the last few years. Twitter, one such social media platform, is highly popular amongst all age groups, who communicate on diverse topics including but not limited to news, current events, politics, emerging technologies, family, relationships, and career opportunities, via tweets, while sharing their views, opinions, perspectives, and feedback towards the same. Therefore, this work presents a dataset of about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. Instructions: This dataset contains about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. The dataset contains only tweet identifiers (Tweet IDs) due to the terms and conditions of Twitter to re-distribute Twitter data only for research purposes. They need to be hydrated to be used. The process of retrieving a tweet's complete information (such as the text of the tweet, username, user ID, date and time, etc.) using its ID is known as the hydration of a tweet ID. The Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) or any similar application may be used for hydrating this dataset. Data Description This dataset consists of 7 .txt files. The following shows the number of Tweet IDs and the date range (of the associated tweets) in each of these files. Filename: Exoskeleton_TweetIDs_Set1.txt (Number of Tweet IDs – 22945, Date Range of Tweets - July 20, 2021 – May 21, 2022) Filename: Exoskeleton_TweetIDs_Set2.txt (Number of Tweet IDs – 19416, Date Range of Tweets - Dec 1, 2020 – July 19, 2021) Filename: Exoskeleton_TweetIDs_Set3.txt (Number of Tweet IDs – 16673, Date Range of Tweets - April 29, 2020 - Nov 30, 2020) Filename: Exoskeleton_TweetIDs_Set4.txt (Number of Tweet IDs – 16208, Date Range of Tweets - Oct 5, 2019 - Apr 28, 2020) Filename: Exoskeleton_TweetIDs_Set5.txt (Number of Tweet IDs – 17983, Date Range of Tweets - Feb 13, 2019 - Oct 4, 2019) Filename: Exoskeleton_TweetIDs_Set6.txt (Number of Tweet IDs – 34009, Date Range of Tweets - Nov 9, 2017 - Feb 12, 2019) Filename: Exoskeleton_TweetIDs_Set7.txt (Number of Tweet IDs – 11351, Date Range of Tweets - May 21, 2017 - Nov 8, 2017) Here, the last date for May is May 21 as it was the most recent date at the time of data collection. The dataset would be updated soon to incorporate more recent tweets.
Research Papers Dataset
kaggle.com
zip
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NECHBA MOHAMMED (2023). Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/nechbamohammed/research-papers-dataset
Explore at:
zip(619131172 bytes)Available download formats
Dataset updated
May 8, 2023
Authors
NECHBA MOHAMMED
Description
Description: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.

Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.

Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."

Cite: https://www.aminer.cn/citation
h
HindiMathQuest
huggingface.co
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dnyanesh Walwadkar (2024). HindiMathQuest [Dataset]. http://doi.org/10.57967/hf/3259
Explore at:
Unique identifier
https://doi.org/10.57967/hf/3259
Dataset updated
Oct 25, 2024
Authors
Dnyanesh Walwadkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview:

The HindiMathQuest: A Dataset for Mathematical Reasoning and Problem-Solving in Hindi is designed to advance the capabilities of language models in understanding and solving mathematical problems presented in the Hindi language. The dataset covers a comprehensive range of question types, including logical reasoning, numeric calculations, translation-based problems, and complex mathematical tasks typically seen in competitive exams. This dataset is intended to fill a… See the full description on the dataset page: https://huggingface.co/datasets/dnyanesh/HindiMathQuest.
c
Suspension Rate by Grade Range - Datasets - CTData.org
data.ctdata.org
Updated Mar 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Suspension Rate by Grade Range - Datasets - CTData.org [Dataset]. http://data.ctdata.org/dataset/suspension-rate-by-grade-range
Explore at:
Dataset updated
Mar 16, 2016
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Full Description This dataset reports the total number of unique, unduplicated students in a given grade range that have received at least one In-school Suspension (ISS), Out-of-school Suspension (OSS), or Expulsion (EXP) out of the total number of students enrolled in the Public School Information System (PSIS) as of October of the given year. This dataset is based on School Years. Elementary includes Pre-Kindergarten through grade 5. Middle School includes grade 6 through grade 8. High School includes grade 9 through grade 12.

Facebook

Twitter

Click to copy link

Link copied

Cite

DeepMind (2019). Mathematics Dataset [Dataset]. https://github.com/Wikidepia/mathematics_dataset_id

Mathematics Dataset

Explore at:

Dataset updated

Apr 3, 2019

Dataset provided by

DeepMindhttp://deepmind.com/

Description

This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

## Example questions

 Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
 Answer: 4
 
 Question: Calculate -841880142.544 + 411127.
 Answer: -841469015.544
 
 Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
 Answer: 54*a - 30

It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:

algebra (linear equations, polynomial roots, sequences)
arithmetic (pairwise operations and mixed expressions, surds)
calculus (differentiation)
comparison (closest numbers, pairwise comparisons, sorting)
measurement (conversion, working with time)
numbers (base conversion, remainders, common divisors and multiples, primality, place value, rounding numbers)
polynomials (addition, simplification, composition, evaluating, expansion)
probability (sampling without replacement)

Clear search

Close search

Google apps

Main menu

Mathematics Dataset

math_dataset

MathInstruct Dataset: Hybrid Math Instruction

MathInstruct Dataset: Hybrid Math Instruction Tuning

A curated dataset for math instruction tuning models

About this dataset

How to use the dataset

Research Ideas

Trends in Math Proficiency (2012-2023): Range View Elementary School vs....

Advanced-Math

Prime gap frequency distribution (powers of 2)

Prime Number Source Code with Dataset

Mathematical Mathematics Memes

Introducing the Mathematics Meme Repository

Data from: Overcoming the challenge of small effective sample sizes in...

Data from: Correcting for missing and irregular data in home-range...

Range View Elementary School

Data from: SyMANTIC: An Efficient Symbolic Regression Method for...

AlgebraicEquationsGenerator

Meta data and supporting documentation

Simulation Data Set

Data from: MIMIC-IV-Ext-Instr: A Dataset of 450K+ EHR-Grounded...

Data from: Twitter Big Data as A Resource For Exoskeleton Research: A...

Research Papers Dataset

HindiMathQuest

Suspension Rate by Grade Range - Datasets - CTData.org

Mathematics Dataset