28 datasets found
  1. OCR large data set

    • kaggle.com
    zip
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Mann (2023). OCR large data set [Dataset]. https://www.kaggle.com/datasets/jame5mann/ocr-large-data-set
    Explore at:
    zip(264412 bytes)Available download formats
    Dataset updated
    Feb 15, 2023
    Authors
    James Mann
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the large data set as featured in the OCR H240 exam series.

    Questions about this dataset will be featured in the statistics paper

    The LDS is a .xlsx file containing 5 tables, four data, one information. The data is drawn from the UK censuses from the years 2001 and 2011. It is designed for you to make comparisons and analyses of the changes in demographic and behavioural features of the populace. There is the age structure of each local authority and the method of travel within each local authority.

  2. Airoboros LLMs Math Dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Airoboros LLMs Math Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/airoboros-llms-math-dataset
    Explore at:
    zip(36964941 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Airoboros LLMs Math Dataset

    Mastering Complex Mathematical Operations in Machine Learning

    By Huggingface Hub [source]

    About this dataset

    The Airoboros-3.1 dataset is the perfect tool to help machine learning models excel in the difficult realm of complicated mathematical operations. This data collection features thousands of conversations between machines and humans, formatted in ShareGPT to maximize optimization in an OS ecosystem. The dataset’s focus on advanced subjects like factorials, trigonometry, and larger numerical values will help drive machine learning models to the next level - facilitating critical acquisition of sophisticated mathematical skills that are essential for ML success. As AI technology advances at such a rapid pace, training neural networks to correspondingly move forward can be a daunting and complicated challenge - but with Airoboros-3.1’s powerful datasets designed around difficult mathematical operations it just became one step closer to achievable!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    To get started, download the dataset from Kaggle and use the train.csv file. This file contains over two thousand examples of conversations between ML models and humans which have been formatted using ShareGPT - fast and efficient OS ecosystem fine-tuning tools designed to help with understanding mathematical operations more easily. The file includes two columns: category and conversations, both of which are marked as strings in the data itself.

    Once you have downloaded the train file you can begin setting up your own ML training environment by using any of your preferred frameworks or methods. Your model should focus on predicting what kind of mathematical operations will likely be involved in future conversations by referring back to previous dialogues within this dataset for reference (category column). You can also create your own test sets from this data, adding new conversation topics either by modifying existing rows or creating new ones entirely with conversation topics related to mathematics. Finally, compare your model’s results against other established models or algorithms that are already published online!

    Happy training!

    Research Ideas

    • It can be used to build custom neural networks or machine learning algorithms that are specifically designed for complex mathematical operations.
    • This data set can be used to teach and debug more general-purpose machine learning models to recognize large numbers, and intricate calculations within natural language processing (NLP).
    • The Airoboros-3.1 dataset can also be utilized as a supervised learning task: models could learn from the conversations provided in the dataset how to respond correctly when presented with complex mathematical operations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | category | The type of mathematical operation being discussed. (String) | | conversations | The conversations between the machine learning model and the human. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  3. h

    Math-RLVR

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi Su (2025). Math-RLVR [Dataset]. https://huggingface.co/datasets/virtuoussy/Math-RLVR
    Explore at:
    Dataset updated
    Mar 31, 2025
    Authors
    Yi Su
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Math data for paper "Expanding RL with Verifiable Rewards Across Diverse Domains". we use a large-scale dataset of 773k Chinese Question Answering (QA) pairs, collected under authorized licenses from educational websites. This dataset covers three educational levels: elementary, middle, and high school. Unlike well-structured yet small-scale benchmarks such as MATH (Hendrycks et al., 2021b) and GSM8K (Cobbe et al., 2021b), our reference answers are inherently free-form, often interwoven with… See the full description on the dataset page: https://huggingface.co/datasets/virtuoussy/Math-RLVR.

  4. Z

    Data from: MLFMF: Data Sets for Machine Learning for Mathematical...

    • data.niaid.nih.gov
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo (2023). MLFMF: Data Sets for Machine Learning for Mathematical Formalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10041074
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Institute of Mathematics, Physics, and Mechanics
    University of Ljubljana
    Authors
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLFMF MLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format. In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms. The four data sets Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes

    the largest Lean 4 library Mathlib, the three largest Agda libraries:

    the standard library the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry. The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool). The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch). For more details, see our arXiv copy of the paper. Directory structure First, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory contains

    the network file from which the heterogeneous network can be loaded, a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library. In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present. Loading the data In addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt. When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object). Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node). More information For more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper. For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data. Funding Since not all the funders are available in the Zenodo's database, we list them here:

    This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024. The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.

  5. Z

    MLFMF: Data Sets for Machine Learning for Mathematical Formalization

    • data-staging.niaid.nih.gov
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Institute of Mathematics, Physics, and Mechanics
    University of Ljubljana
    Authors
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLFMF MLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format. In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms. The four data sets Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes

    the largest Lean 4 library Mathlib, the three largest Agda libraries:

    the standard library the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry. The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool). The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch). For more details, see our arXiv copy of the paper. Directory structure First, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory contains

    the network file from which the heterogeneous network can be loaded, a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library. In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present. Loading the data In addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt. When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object). Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node). More information For more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper. For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data. Funding Since not all the funders are available in the Zenodo's database, we list them here:

    This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024. The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.

  6. l

    Supplementary Information files for A gifted SNARC? Directional...

    • repository.lboro.ac.uk
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunfeng He; Hans-Christoph Nuerk; Alexander Derksen; Jiannong Shi; Xinlin Zhou; Krzysztof Cipora (2023). Supplementary Information files for A gifted SNARC? Directional spatial-numerical associations in gifted children with high-level math skills do not differ from controls [Dataset]. http://doi.org/10.17028/rd.lboro.12820673.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Loughborough University
    Authors
    Yunfeng He; Hans-Christoph Nuerk; Alexander Derksen; Jiannong Shi; Xinlin Zhou; Krzysztof Cipora
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Information files for A gifted SNARC? Directional spatial-numerical associations in gifted children with high-level math skills do not differ from controlsThe SNARC (Spatial-Numerical Association of Response Codes) efect (i.e., a tendency to associate small/large magnitude numbers with the left/right hand side) is prevalent across the whole lifespan. Because the ability to relate numbers to space has been viewed as a cornerstone in the development of mathematical skills, the relationship between the SNARC efect and math skills has been frequently examined. The results remain largely inconsistent. Studies testing groups of people with very low or very high skill levels in math sometimes found relationships between SNARC and math skills. So far, however, studies testing such extreme math skills level groups were mostly investigating the SNARC efect in individuals revealing math difculties. Groups with above average math skills remain understudied, especially in regard to children. Here, we investigate the SNARC efect in gifted children, as compared to normally developing children (overall n=165). Frequentist and Bayesian analysis suggested that the groups did not difer from each other in the SNARC efect. These results are the frst to provide evidence for the SNARC efect in a relatively large sample of gifted (and mathematically highly skilled) children. In sum, our study provides another piece of evidence for no direct link between the SNARC efect and mathematical ability in childhood.

  7. h

    Big-Math-RL-Verified-Processed

    • huggingface.co
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open R1 (2025). Big-Math-RL-Verified-Processed [Dataset]. https://huggingface.co/datasets/open-r1/Big-Math-RL-Verified-Processed
    Explore at:
    Dataset updated
    May 13, 2025
    Dataset authored and provided by
    Open R1
    Description

    Dataset Card for Big-Math-RL-Verified-Processed

    This is a processed version of SynthLabsAI/Big-Math-RL-Verified where we have applied the following filters:

    Removed samples where llama8b_solve_rate is None Removed samples that could not be parsed by math-verify (empty lists)

    We have also created 5 additional subsets to indicate difficulty level, similar to the MATH dataset. To do so, we computed quintiles on the llama8b_solve_rate values and then filtered the dataset into the… See the full description on the dataset page: https://huggingface.co/datasets/open-r1/Big-Math-RL-Verified-Processed.

  8. s

    Education Attainment: Key Stage 4 - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Apr 11, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Education Attainment: Key Stage 4 - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/education-attainment-key-stage-4
    Explore at:
    Dataset updated
    Apr 11, 2018
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This data shows Education Attainment at Key Stage 4. Numbers and percentages of pupils attaining at Key Stage 4 are shown by gender. Points to be aware of: • In 2016-2017, children were assessed under new school accountability standards with a new grading system of grades 9 to 1 instead of A* to G. This means data for the academic year ending in 2017 is not comparable with previous years' data. Analysis and comparisons between groups of pupils, types of schools and pupil characteristics are more likely to provide more meaningful information than comparisons over time. • Two new headline standards are shown in this dataset: English and maths strong passes at grades 9-5, and the English Baccalaureate with strong passes at grades 9 to 5 in English and maths. In addition, we have also provided both statistics based on standard passes at grade 9 to 4, as these statistics should be comparable with historical A*-C measures. More information: see the Secondary Curriculum, key stage 3 and key stage 4 (GCSEs) website (link to this included as Resource accompanying these datasets). Data is included for Wards, Lower Super Output Areas (LSOA), Districts, and Lincolnshire. The data has been aggregated based on pupil postcode and only includes those pupils living and educated within Lincolnshire. If you want Lincolnshire and District aggregations based on those pupils that are educated within Lincolnshire, irrespective of where they live; then please see the Department for Education Statistics website and School Performance Tables (links to these included as Resources accompanying these datasets). Data is suppressed where appropriate 5 persons and below (this may be shown by missing data). That and any unmatched postcodes may mean numbers for small areas might not add up exactly to figures shown for larger areas. This data is updated annually. Data source: Lincolnshire County Council, Performance Services – Schools Performance. For any enquiries about this publication contact schoolperformancedata@lincolnshire.gov.uk Please note: National data for Key Stage 4 results are published via: https://explore-education-statistics.service.gov.uk/find-statistics/key-stage-4-performance – GOV.UK (explore-education-statistics.service.gov.uk) There have been methodological changes since 2019 to cater for the issues seen during the pandemic. The DfE offer the following commentary via the link above: “Last academic year saw the return of the summer exam series, after they had been cancelled in 2020 and 2021 due to the impact of the COVID-19 pandemic, where alternative processes were set up to award grades (centre assessment grades, known as CAGs, and teacher assessed grades, known as TAGs). As part of the transition back to the summer exam series adaptations were made to the exams (including advance information) and the approach to grading for 2022 exams broadly reflected a midpoint between results in 2019 and 2021. More information on these changes can be seen in the Guide to GCSE results for England, summer 2022. Given the unprecedented change in the way GCSE results were awarded in the summers of 2020 and 2021, as well as the changes to grade boundaries and methods of assessment for 2021/22, users need to exercise caution when considering comparisons over time, as they may not reflect changes in pupil performance alone.”

  9. Sigma Dolphin Filtered and Cleaned

    • kaggle.com
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Mutiga (2024). Sigma Dolphin Filtered and Cleaned [Dataset]. https://www.kaggle.com/datasets/ryanmutiga/sigma-dolphin-filtered-and-cleaned
    Explore at:
    zip(60569 bytes)Available download formats
    Dataset updated
    Jun 25, 2024
    Authors
    Ryan Mutiga
    Description

    Dataset Description for Filtered Sigma Dolphin Dataset

    Overview

    This dataset is a cleaned and filtered version of the Sigma Dolphin dataset (https://www.kaggle.com/datasets/saurabhshahane/sigmadolphin), designed to aid in solving maths word problems using AI techniques. This was used as an effort towards taking part in the AI Mathematical Olympiad - Progress Prize 1 (https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/overview). The dataset was processed using TF-IDF vectorisation and K-means clustering, specifically targeting questions relevant to the AIME (American Invitational Mathematics Examination) and AMC 12 (American Mathematics Competitions).

    Context

    The Sigma Dolphin dataset is a project initiated by Microsoft Research Asia, aimed at building an intelligent system with natural language understanding and reasoning capacities to automatically solve maths word problems written in natural language. This project began in early 2013, and the dataset includes maths word problems from various sources, including community question-answering sites like Yahoo! Answers.

    Source and Original Dataset Details

    Content

    The filtered dataset includes problems that are relevant for preparing for maths competitions such as AIME and AMC. The data is structured to facilitate the training and evaluation of AI models aimed at solving these types of problems.

    Datasets:

    There are several filtered versions of the dataset based on different similarity thresholds (0.3 and 0.5). These thresholds were used to determine the relevance of problems from the original Sigma Dolphin dataset to the AIME and AMC problems.

    1. Number Word Problems Filtered at 0.3 Threshold:

      • File: number_word_test_filtered_0.3_Threshold.csv
      • Description: Contains problems filtered with a similarity threshold of 0.3, ensuring moderate relevance to AIME and AMC 12 problems.
    2. Number Word Problems Filtered at 0.5 Threshold:

      • File: number_word_std.test_filtered_0.5_Threshold.csv
      • Description: Contains problems filtered with a higher similarity threshold of 0.5, ensuring higher relevance to AIME and AMC 12 problems.
    3. Filtered Number Word Problems 2 at 0.3 Threshold:

      • File: filtered_number_word_problems2_Threshold.csv
      • Description: Another set of problems filtered at a 0.3 similarity threshold.
    4. Filtered Number Word Problems 2 at 0.5 Threshold:

      • File: filtered_number_word_problems_Threshold.csv
      • Description: Another set of problems filtered at a 0.5 similarity threshold.

    Why Different Similarity Thresholds?

    Different similarity thresholds (0.3 and 0.5) are used to provide flexibility in selecting problems based on their relevance to AIME and AMC problems. A lower threshold (0.3) includes a broader range of problems, ensuring a diverse set of questions, while a higher threshold (0.5) focuses on problems with stronger relevance, offering a more targeted and precise dataset. This allows users to choose the level of specificity that best fits their needs.

    For a detailed explanation of the preprocessing and filtering process, please refer to the Sigma Dolphin Filtered & Cleaned Notebook.

    Acknowledgements

    We extend our gratitude to all the original authors of the Sigma Dolphin dataset and the creators of the AIME and AMC problems. This project leverages the work of numerous researchers and datasets to build a comprehensive resource for AI-based problem solving in mathematics.

    Usage

    This dataset is intended for research and educational purposes. It can be used to train AI models for natural language processing and problem-solving tasks, specifically targeting maths word problems in competitive environments like AIME and AMC.

    Licensing

    This dataset is shared under the Computational Use of Data Agreement v1.0.

    This description provides an extensive overview of the dataset, its sources, contents, and usage. If any specific details or additional sections are needed, please let me know!

  10. h

    We-Math2.0-Pro

    • huggingface.co
    Updated Aug 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    We-Math (2025). We-Math2.0-Pro [Dataset]. https://huggingface.co/datasets/We-Math/We-Math2.0-Pro
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    We-Math
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for We-Math 2.0

    GitHub | Paper | Website We-Math 2.0 is a unified system designed to comprehensively enhance the mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs). It integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to achieve both broad conceptual coverage and robust reasoning performance across varying difficulty levels. The key… See the full description on the dataset page: https://huggingface.co/datasets/We-Math/We-Math2.0-Pro.

  11. Equations and Inequalities Making Mathematics Accessible to All

    • catalog.data.gov
    • s.cnmilf.com
    Updated Mar 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of State (2021). Equations and Inequalities Making Mathematics Accessible to All [Dataset]. https://catalog.data.gov/dataset/equations-and-inequalities-making-mathematics-accessible-to-all
    Explore at:
    Dataset updated
    Mar 30, 2021
    Dataset provided by
    United States Department of Statehttp://state.gov/
    Description

    This report, based on results from PISA 2012, shows that one way forward is to ensure that all students spend more “engaged” time learning core mathematics concepts and solving challenging mathematics tasks. The opportunity to learn mathematics content – the time students spend learning mathematics topics and practising maths tasks at school – can accurately predict mathematics literacy. Differences in students’ familiarity with mathematics concepts explain a substantial share of performance disparities in PISA between socio-economically advantaged and disadvantaged students. Widening access to mathematics content can raise average levels of achievement and, at the same time, reduce inequalities in education and in society at large.

  12. d

    2007 - 2008 School Progress Reports - All Schools

    • catalog.data.gov
    • data.cityofnewyork.us
    • +2more
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2024). 2007 - 2008 School Progress Reports - All Schools [Dataset]. https://catalog.data.gov/dataset/2007-2008-school-progress-reports-all-schools
    Explore at:
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    data.cityofnewyork.us
    Description

    2007/08 Progress Report results for all schools (data as of 1/13/09) Peer indices are calculated differently depending on School Level. Schools are only compared to other schools in the same School Level (e.g., Elementary, K-8, Middle, High) 1) Elementary & K-8 - peer index is a value from 0-100. We use a composite demographic statistic based on % ELL, % SpEd, % Title I free lunch, and % Black/Hispanic. Higher values indicate student populations with higher need. 2) Middle & High - peer index is a value from 1.00-4.50. For middle schools, we use the average 4th grade proficiency ratings in ELA and Math for all their students that have 4th grade test scores. For high schools, we use the average 8th grade proficiency ratings in ELA and Math for all their students that have 8th grade test scores, % SpEd, and % Overage. Lower values indicate student populations with higher need. 3) Schools for Transfer Students - peer index is a value from 1.00-4.50. We use the average 8th grade proficiency ratings in ELA and Math for all their students that have 8th grade test scores and the % Overage/Under credited. Lower values indicate student populations with higher need. Unlike Elementary, Middle, and High School Progress Reports, the Environment Category is only composed of Survey Results.

  13. Data from: S1 Dataset -

    • plos.figshare.com
    xlsx
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danni Li; Jeffrey Liew; Dwayne Raymond; Tracy Hammond (2023). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0292844.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Danni Li; Jeffrey Liew; Dwayne Raymond; Tracy Hammond
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Students’ math motivation can predict engagement, achievement, and career interest in science, technology, engineering, and mathematics (STEM). However, it is not well understood how personality traits and math anxiety may be linked to different types or qualities of math motivation, particularly during high-stress times such as the COVID-19 pandemic. In this study, we examined how fearful or avoidant temperaments contribute to math anxiety and math motivations for college students during the COVID-19 pandemic. Ninety-six undergraduate students from a large public university were assessed on temperamental fear, math anxiety, and math motivation in an online math course. Results showed that higher levels of temperamental fear are directly linked to higher levels of math anxiety. In addition, temperamental fear is indirectly linked to higher levels of autonomous motivation (i.e., intrinsic motivation and identified regulation) and lower levels of controlled motivation (i.e., external regulation) through math anxiety. Results have implications for helping students at high risk for both high math anxiety and for low motivation to engage in math learning.

  14. Aida Calculus Math Handwriting Recognition Dataset

    • kaggle.com
    zip
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aida by Pearson (2020). Aida Calculus Math Handwriting Recognition Dataset [Dataset]. https://www.kaggle.com/aidapearson/ocr-data
    Explore at:
    zip(10833406726 bytes)Available download formats
    Dataset updated
    Aug 20, 2020
    Authors
    Aida by Pearson
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Context

    The Aida Calculus Math Handwriting Recognition Dataset consists of 100,000 images in 10 batches. Each image contains a photo of a handwritten calculus math expression (specifically within the topic of limits) written with a dark utensil on plain paper. Each image is accompanied by ground truth math expression in LaTeX as well as bounding boxes and pixel-level masks per character. All images are synthetically generated.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F67bf0c680286baf2c979c8207a991bb2%2FScreen%20Shot%202020-08-19%20at%201.02.50%20PM.png?generation=1597868629120369&alt=media%20=500x100" alt="">

    Motivation

    The complexity of handwriting recognition for math expressions can be decomposed into the following sources of variability:

    Image of Math = Math Expression x Math Characters x Location of Math Characters x Visual Qualities of the Math Characters (fonts, color) x Noise of Image (backgrounds, stray marks)

    It is the job of the recognition model to take the Image of Math as input and predict the Math Expression.
    Typical approaches to handwritten recognition tasks involve collecting and tagging of large amounts of data, on which many iterations of models are trained. The "one dataset, many models" paradigm has specific drawbacks within the context of product development. As product requirements evolve, such as the addition of a new mathematical character into the prediction space, a new data collection and tagging effort must be undertaken. The cycle of adapting the handwriting recognition capability to new requirements is long and does not support agile product development.

    Here, we take a different approach by iteratively building a complex, synthetically generated dataset towards specific requirements. The generation process delivers exact control over the distribution of math expressions, characters, location of characters, specific visual qualities of the math, image noise, and image augmentations to the developer. The developer controls every aspect of the data, down to each pixel. In many ways, the data synthesis runs backwards to the handwriting recognition model, creating visual complexity that the model must then untangle to uncover the ground truth math expression. Thus, we can arrive at a "many datasets, one model" paradigm that as product requirements change, the data can quickly iterate and adapt on agile cycles.

    In addition to affording more control over the product development process, synthetic data allows for 100% correct pixel by pixel tagging that opens the door for new modeling possibilities. Every image is tagged with the ground truth LaTeX for the expressions, bounding boxes per math character, and exact pixel masks for each character.

    Our goal in releasing this dataset is to provide the data science and machine learning community with resources for undertaking the challenging computer vision task of extracting math expressions from images. The data offers something to all levels, from beginners building simple character recognition models to experts who wish to predict pixel-by-pixel masks and decode the complex structure of math expressions.

    Content

    The images contain math expressions of limits, a topic typically encountered by students learning Calculus I in the United States. Features of the writing such as font, writing utensils (type, color, pressure, consistency), angle and distance of photo, and size of writing are all simulated. Backgrounds features include shadows, various plain paper types, bleed throughs, other distortions, and noise typical of student taking photos of their math.

    The strategy in defining the populations from which images are synthesized is to be a superset of what we expect students to submit. Therefore, the math expressions are not in themselves pedagogical, but aim to encompass the potential variety of student submissions, both mathematically correct and incorrect. The image features and augmentations are similarly designed to cover the range of possible student handwriting qualities.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F78c49b9673f8d07c91cd5c929e50ed13%2FPicture2.png?generation=1597361067979205&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F38f70b6a773709eb02578f20634e8433%2FPicture1.png?generation=1597361068613807&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F17a3a78ac635cd728f9d6ef32609aee8%2FPicture3.png?generation=1597361068784034&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2Fc052749a8085d66aa7bf97c78a4b6c6a%2FPicture4.png?generation=1597361068949074&alt=media%20=250x100" alt="">

    Data consis...

  15. h

    Maths-Grade-School

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feynman Innovations, Maths-Grade-School [Dataset]. http://doi.org/10.57967/hf/3167
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Feynman Innovations
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Maths-Grade-School I am releasing large Grade School level Mathematics datatset. This extensive dataset, comprising nearly one million instructions in JSON format, encapsulates a diverse array of topics fundamental to building a strong mathematical foundation. This dataset is in instruction format so that model developers, researchers etc. can easily use this dataset. Following Fields & sub Fields are covered: Calculus Probability Algebra Liner Algebra Trigonometry Differential Equations… See the full description on the dataset page: https://huggingface.co/datasets/ajibawa-2023/Maths-Grade-School.

  16. GSM8K - Grade School Math 8K dataset for LLM

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnson chong (2024). GSM8K - Grade School Math 8K dataset for LLM [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/gsm8k-grade-school-math-8k-dataset-for-llm
    Explore at:
    zip(5156809 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    Johnson chong
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem: from the paper, "Problems require no concepts beyond the level of early Algebra, and the vast majority of problems can be solved without explicitly defining a variable." Solutions are provided in natural language, as opposed to pure math expressions. From the paper: "We believe this is the most generally useful data format, and we expect it to shed light on the properties of large language models’ internal monologues"

  17. h

    cmath

    • huggingface.co
    • opendatalab.com
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Tianwen (2023). cmath [Dataset]. https://huggingface.co/datasets/weitianwen/cmath
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 22, 2023
    Authors
    Wei Tianwen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CMATH

      Introduction
    

    We present the Chinese Elementary School Math Word Problems (CMATH) dataset, comprising 1.7k elementary school-level math word problems with detailed annotations, source from actual Chinese workbooks and exams. This dataset aims to provide a benchmark tool for assessing the following question: to what grade level of elementary school math do the abilities of popular large language models (LLMs) correspond? We evaluate a variety of popular LLMs… See the full description on the dataset page: https://huggingface.co/datasets/weitianwen/cmath.

  18. Students Performance in Exams

    • kaggle.com
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Adeyemi (2024). Students Performance in Exams [Dataset]. https://www.kaggle.com/datasets/timothyadeyemi/students-performance-in-exams
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2024
    Dataset provided by
    Kaggle
    Authors
    Timothy Adeyemi
    Description

    Description This dataset provides a detailed snapshot of high school students' performance in exams, focusing on their scores in mathematics, reading, and writing. It includes essential demographic, social, and academic variables that are known to influence academic outcomes. The dataset consists of 1,000 observations, where each row represents a unique student, and includes various attributes such as gender, race/ethnicity, parental education levels, test preparation status, lunch type, and scores in three key academic subjects. This dataset can be leveraged to analyze trends, correlations, and disparities in academic performance based on socioeconomic and educational factors.

    AttributeDescription
    GenderThis column categorizes students by their gender (Male, Female). Allows for the exploration of gender-based performance trends in math, reading, and writing scores.
    Race/EthnicityCoded into five groups (Group A to Group E), this feature represents the racial or ethnic background of the student. Enables analysis of how ethnic backgrounds influence exam performance.
    Parental Level of EducationDescribes the highest educational attainment of the student’s parents (e.g., High School, Some College, Associate’s Degree, Bachelor’s Degree, Master’s Degree). This variable is useful in understanding the impact of parental education on students' academic achievements.
    Lunch TypeIndicates whether the student receives a standard lunch or a free/reduced-price lunch. This feature can be used to study the relationship between socioeconomic status and academic performance.
    Test Preparation CourseDescribes whether the student completed a test preparation course (Completed or None). Examines the influence of structured test preparation on academic outcomes.
    Math ScoreThis column records the student’s performance in mathematics (on a scale of 0-100). A key outcome variable for assessing performance in a core subject.
    Reading ScoreSimilar to the math score, this feature captures the student’s performance in reading (on a scale of 0-100). Provides insight into students' literacy and comprehension abilities.
    Writing ScoreRepresents the student’s performance in writing (on a scale of 0-100). Allows for analysis of written communication skills and overall language proficiency.
  19. Student Performance Data Set

    • kaggle.com
    zip
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    zip(12353 bytes)Available download formats
    Dataset updated
    Mar 27, 2020
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  20. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
James Mann (2023). OCR large data set [Dataset]. https://www.kaggle.com/datasets/jame5mann/ocr-large-data-set
Organization logo

OCR large data set

The LDS used in the OCR A level maths exam (statistics)

Explore at:
156 scholarly articles cite this dataset (View in Google Scholar)
zip(264412 bytes)Available download formats
Dataset updated
Feb 15, 2023
Authors
James Mann
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This is the large data set as featured in the OCR H240 exam series.

Questions about this dataset will be featured in the statistics paper

The LDS is a .xlsx file containing 5 tables, four data, one information. The data is drawn from the UK censuses from the years 2001 and 2011. It is designed for you to make comparisons and analyses of the changes in demographic and behavioural features of the populace. There is the age structure of each local authority and the method of travel within each local authority.

Search
Clear search
Close search
Google apps
Main menu