8 datasets found
  1. OCR large data set

    • kaggle.com
    zip
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Mann (2023). OCR large data set [Dataset]. https://www.kaggle.com/datasets/jame5mann/ocr-large-data-set
    Explore at:
    zip(264412 bytes)Available download formats
    Dataset updated
    Feb 15, 2023
    Authors
    James Mann
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the large data set as featured in the OCR H240 exam series.

    Questions about this dataset will be featured in the statistics paper

    The LDS is a .xlsx file containing 5 tables, four data, one information. The data is drawn from the UK censuses from the years 2001 and 2011. It is designed for you to make comparisons and analyses of the changes in demographic and behavioural features of the populace. There is the age structure of each local authority and the method of travel within each local authority.

  2. Airoboros LLMs Math Dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Airoboros LLMs Math Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/airoboros-llms-math-dataset
    Explore at:
    zip(36964941 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Airoboros LLMs Math Dataset

    Mastering Complex Mathematical Operations in Machine Learning

    By Huggingface Hub [source]

    About this dataset

    The Airoboros-3.1 dataset is the perfect tool to help machine learning models excel in the difficult realm of complicated mathematical operations. This data collection features thousands of conversations between machines and humans, formatted in ShareGPT to maximize optimization in an OS ecosystem. The dataset’s focus on advanced subjects like factorials, trigonometry, and larger numerical values will help drive machine learning models to the next level - facilitating critical acquisition of sophisticated mathematical skills that are essential for ML success. As AI technology advances at such a rapid pace, training neural networks to correspondingly move forward can be a daunting and complicated challenge - but with Airoboros-3.1’s powerful datasets designed around difficult mathematical operations it just became one step closer to achievable!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    To get started, download the dataset from Kaggle and use the train.csv file. This file contains over two thousand examples of conversations between ML models and humans which have been formatted using ShareGPT - fast and efficient OS ecosystem fine-tuning tools designed to help with understanding mathematical operations more easily. The file includes two columns: category and conversations, both of which are marked as strings in the data itself.

    Once you have downloaded the train file you can begin setting up your own ML training environment by using any of your preferred frameworks or methods. Your model should focus on predicting what kind of mathematical operations will likely be involved in future conversations by referring back to previous dialogues within this dataset for reference (category column). You can also create your own test sets from this data, adding new conversation topics either by modifying existing rows or creating new ones entirely with conversation topics related to mathematics. Finally, compare your model’s results against other established models or algorithms that are already published online!

    Happy training!

    Research Ideas

    • It can be used to build custom neural networks or machine learning algorithms that are specifically designed for complex mathematical operations.
    • This data set can be used to teach and debug more general-purpose machine learning models to recognize large numbers, and intricate calculations within natural language processing (NLP).
    • The Airoboros-3.1 dataset can also be utilized as a supervised learning task: models could learn from the conversations provided in the dataset how to respond correctly when presented with complex mathematical operations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | category | The type of mathematical operation being discussed. (String) | | conversations | The conversations between the machine learning model and the human. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  3. u-math

    • huggingface.co
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toloka (2024). u-math [Dataset]. https://huggingface.co/datasets/toloka/u-math
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2024
    Dataset authored and provided by
    Tolokahttps://www.toloka.ai/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    U-MATH is a comprehensive benchmark of 1,100 unpublished university-level problems sourced from real teaching materials. It is designed to evaluate the mathematical reasoning capabilities of Large Language Models (LLMs). The dataset is balanced across six core mathematical topics and includes 20% of multimodal problems (involving visual elements such as graphs and diagrams). For fine-grained performance evaluation results and detailed discussion, check out our paper.

    📊 U-MATH benchmark at… See the full description on the dataset page: https://huggingface.co/datasets/toloka/u-math.

  4. N

    Individual Brain Charting dataset extension, second release of...

    • neurovault.org
    nifti
    Updated Feb 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Individual Brain Charting dataset extension, second release of high-resolution fMRI data for cognitive mapping: sub-15_ses-02_task-hcp_language_dir-ap_story-math [Dataset]. http://identifiers.org/neurovault.image:363842
    Explore at:
    niftiAvailable download formats
    Dataset updated
    Feb 14, 2020
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    glassbrain

    Collection description

    The individual Brain Charting (IBC) Project is using high resolution fMRI to map 13 subjects that undergo a large number of tasks: the HCP tasks, the so-called ARCHI tasks, a specific language task, video watching, low-level visual stimulation etc. The native resolution of the data is 1.5mm isotropic. Their main value lies in the large number of contrasts probed, the level of detail and the high SNR per subject. This dataset is meant to provide the basis of a functional brain atlas. We upload here smoothed individual SPMs. The uploaded maps comprise session-specific and fixed effects across maps acquired with AP and PA phase encoding directions.

    Note that Neurovault collection #4438 is a subset of that one. In the present collections, some details have been fixed, including mroe accurate and unique file naming.

    Subject species

    homo sapiens

    Modality

    fMRI-BOLD

    Analysis level

    single-subject

    Cognitive paradigm (task)

    language processing fMRI task paradigm

    Map type

    Z

  5. Sigma Dolphin Filtered and Cleaned

    • kaggle.com
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Mutiga (2024). Sigma Dolphin Filtered and Cleaned [Dataset]. https://www.kaggle.com/datasets/ryanmutiga/sigma-dolphin-filtered-and-cleaned
    Explore at:
    zip(60569 bytes)Available download formats
    Dataset updated
    Jun 25, 2024
    Authors
    Ryan Mutiga
    Description

    Dataset Description for Filtered Sigma Dolphin Dataset

    Overview

    This dataset is a cleaned and filtered version of the Sigma Dolphin dataset (https://www.kaggle.com/datasets/saurabhshahane/sigmadolphin), designed to aid in solving maths word problems using AI techniques. This was used as an effort towards taking part in the AI Mathematical Olympiad - Progress Prize 1 (https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/overview). The dataset was processed using TF-IDF vectorisation and K-means clustering, specifically targeting questions relevant to the AIME (American Invitational Mathematics Examination) and AMC 12 (American Mathematics Competitions).

    Context

    The Sigma Dolphin dataset is a project initiated by Microsoft Research Asia, aimed at building an intelligent system with natural language understanding and reasoning capacities to automatically solve maths word problems written in natural language. This project began in early 2013, and the dataset includes maths word problems from various sources, including community question-answering sites like Yahoo! Answers.

    Source and Original Dataset Details

    Content

    The filtered dataset includes problems that are relevant for preparing for maths competitions such as AIME and AMC. The data is structured to facilitate the training and evaluation of AI models aimed at solving these types of problems.

    Datasets:

    There are several filtered versions of the dataset based on different similarity thresholds (0.3 and 0.5). These thresholds were used to determine the relevance of problems from the original Sigma Dolphin dataset to the AIME and AMC problems.

    1. Number Word Problems Filtered at 0.3 Threshold:

      • File: number_word_test_filtered_0.3_Threshold.csv
      • Description: Contains problems filtered with a similarity threshold of 0.3, ensuring moderate relevance to AIME and AMC 12 problems.
    2. Number Word Problems Filtered at 0.5 Threshold:

      • File: number_word_std.test_filtered_0.5_Threshold.csv
      • Description: Contains problems filtered with a higher similarity threshold of 0.5, ensuring higher relevance to AIME and AMC 12 problems.
    3. Filtered Number Word Problems 2 at 0.3 Threshold:

      • File: filtered_number_word_problems2_Threshold.csv
      • Description: Another set of problems filtered at a 0.3 similarity threshold.
    4. Filtered Number Word Problems 2 at 0.5 Threshold:

      • File: filtered_number_word_problems_Threshold.csv
      • Description: Another set of problems filtered at a 0.5 similarity threshold.

    Why Different Similarity Thresholds?

    Different similarity thresholds (0.3 and 0.5) are used to provide flexibility in selecting problems based on their relevance to AIME and AMC problems. A lower threshold (0.3) includes a broader range of problems, ensuring a diverse set of questions, while a higher threshold (0.5) focuses on problems with stronger relevance, offering a more targeted and precise dataset. This allows users to choose the level of specificity that best fits their needs.

    For a detailed explanation of the preprocessing and filtering process, please refer to the Sigma Dolphin Filtered & Cleaned Notebook.

    Acknowledgements

    We extend our gratitude to all the original authors of the Sigma Dolphin dataset and the creators of the AIME and AMC problems. This project leverages the work of numerous researchers and datasets to build a comprehensive resource for AI-based problem solving in mathematics.

    Usage

    This dataset is intended for research and educational purposes. It can be used to train AI models for natural language processing and problem-solving tasks, specifically targeting maths word problems in competitive environments like AIME and AMC.

    Licensing

    This dataset is shared under the Computational Use of Data Agreement v1.0.

    This description provides an extensive overview of the dataset, its sources, contents, and usage. If any specific details or additional sections are needed, please let me know!

  6. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  7. Aida Calculus Math Handwriting Recognition Dataset

    • kaggle.com
    zip
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aida by Pearson (2020). Aida Calculus Math Handwriting Recognition Dataset [Dataset]. https://www.kaggle.com/aidapearson/ocr-data
    Explore at:
    zip(10833406726 bytes)Available download formats
    Dataset updated
    Aug 20, 2020
    Authors
    Aida by Pearson
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Context

    The Aida Calculus Math Handwriting Recognition Dataset consists of 100,000 images in 10 batches. Each image contains a photo of a handwritten calculus math expression (specifically within the topic of limits) written with a dark utensil on plain paper. Each image is accompanied by ground truth math expression in LaTeX as well as bounding boxes and pixel-level masks per character. All images are synthetically generated.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F67bf0c680286baf2c979c8207a991bb2%2FScreen%20Shot%202020-08-19%20at%201.02.50%20PM.png?generation=1597868629120369&alt=media%20=500x100" alt="">

    Motivation

    The complexity of handwriting recognition for math expressions can be decomposed into the following sources of variability:

    Image of Math = Math Expression x Math Characters x Location of Math Characters x Visual Qualities of the Math Characters (fonts, color) x Noise of Image (backgrounds, stray marks)

    It is the job of the recognition model to take the Image of Math as input and predict the Math Expression.
    Typical approaches to handwritten recognition tasks involve collecting and tagging of large amounts of data, on which many iterations of models are trained. The "one dataset, many models" paradigm has specific drawbacks within the context of product development. As product requirements evolve, such as the addition of a new mathematical character into the prediction space, a new data collection and tagging effort must be undertaken. The cycle of adapting the handwriting recognition capability to new requirements is long and does not support agile product development.

    Here, we take a different approach by iteratively building a complex, synthetically generated dataset towards specific requirements. The generation process delivers exact control over the distribution of math expressions, characters, location of characters, specific visual qualities of the math, image noise, and image augmentations to the developer. The developer controls every aspect of the data, down to each pixel. In many ways, the data synthesis runs backwards to the handwriting recognition model, creating visual complexity that the model must then untangle to uncover the ground truth math expression. Thus, we can arrive at a "many datasets, one model" paradigm that as product requirements change, the data can quickly iterate and adapt on agile cycles.

    In addition to affording more control over the product development process, synthetic data allows for 100% correct pixel by pixel tagging that opens the door for new modeling possibilities. Every image is tagged with the ground truth LaTeX for the expressions, bounding boxes per math character, and exact pixel masks for each character.

    Our goal in releasing this dataset is to provide the data science and machine learning community with resources for undertaking the challenging computer vision task of extracting math expressions from images. The data offers something to all levels, from beginners building simple character recognition models to experts who wish to predict pixel-by-pixel masks and decode the complex structure of math expressions.

    Content

    The images contain math expressions of limits, a topic typically encountered by students learning Calculus I in the United States. Features of the writing such as font, writing utensils (type, color, pressure, consistency), angle and distance of photo, and size of writing are all simulated. Backgrounds features include shadows, various plain paper types, bleed throughs, other distortions, and noise typical of student taking photos of their math.

    The strategy in defining the populations from which images are synthesized is to be a superset of what we expect students to submit. Therefore, the math expressions are not in themselves pedagogical, but aim to encompass the potential variety of student submissions, both mathematically correct and incorrect. The image features and augmentations are similarly designed to cover the range of possible student handwriting qualities.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F78c49b9673f8d07c91cd5c929e50ed13%2FPicture2.png?generation=1597361067979205&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F38f70b6a773709eb02578f20634e8433%2FPicture1.png?generation=1597361068613807&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F17a3a78ac635cd728f9d6ef32609aee8%2FPicture3.png?generation=1597361068784034&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2Fc052749a8085d66aa7bf97c78a4b6c6a%2FPicture4.png?generation=1597361068949074&alt=media%20=250x100" alt="">

    Data consis...

  8. BeyondAIME

    • huggingface.co
    Updated Jun 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ByteDance Seed (2025). BeyondAIME [Dataset]. https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    ByteDancehttps://www.bytedance.com/
    Authors
    ByteDance Seed
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    BeyondAIME: Advancing Math Reasoning Evaluation Beyond High School Olympiads

      Dataset Description
    

    BeyondAIME is a curated test set designed to benchmark advanced mathematical reasoning. Its creation was guided by the following core principles to ensure a fair and challenging evaluation:

    High Difficulty: Problems are sourced from high-school and university mathematics competitions, with a difficulty level greater than or equal to that of AIME Problems #11-15.… See the full description on the dataset page: https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
James Mann (2023). OCR large data set [Dataset]. https://www.kaggle.com/datasets/jame5mann/ocr-large-data-set
Organization logo

OCR large data set

The LDS used in the OCR A level maths exam (statistics)

Explore at:
156 scholarly articles cite this dataset (View in Google Scholar)
zip(264412 bytes)Available download formats
Dataset updated
Feb 15, 2023
Authors
James Mann
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This is the large data set as featured in the OCR H240 exam series.

Questions about this dataset will be featured in the statistics paper

The LDS is a .xlsx file containing 5 tables, four data, one information. The data is drawn from the UK censuses from the years 2001 and 2011. It is designed for you to make comparisons and analyses of the changes in demographic and behavioural features of the populace. There is the age structure of each local authority and the method of travel within each local authority.

Search
Clear search
Close search
Google apps
Main menu