100+ datasets found
  1. Math Formula Retrieval

    • kaggle.com
    • huggingface.co
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Math Formula Retrieval [Dataset]. https://www.kaggle.com/datasets/thedevastator/math-formula-pair-classification-dataset/data
    Explore at:
    zip(2021716728 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Math Formula Retrieval

    Math Formula Pair Classification Dataset

    By ddrg (From Huggingface) [source]

    About this dataset

    With a total of six columns, including formula1, formula2, label (binary format), formula1, formula2, and label, the dataset provides all the necessary information for conducting comprehensive analysis and evaluation.

    The train.csv file contains a subset of the dataset specifically curated for training purposes. It includes an extensive range of math formula pairs along with their corresponding labels and unique ID names. This allows researchers and data scientists to construct models that can predict whether two given formulas fall within the same category or not.

    On the other hand, test.csv serves as an evaluation set. It consists of additional pairs of math formulas accompanied by their respective labels and unique IDs. By evaluating model performance on this test set after training it on train.csv data, researchers can assess how well their models generalize to unseen instances.

    By leveraging this informative dataset, researchers can unlock new possibilities in mathematics-related fields such as pattern recognition algorithms development or enhancing educational tools that involve automatic identification and categorization tasks based on mathematical formulas

    How to use the dataset

    Introduction

    Dataset Description

    train.csv

    The train.csv file contains a set of labeled math formula pairs along with their corresponding labels and formula name IDs. It consists of the following columns: - formula1: The first mathematical formula in the pair (text). - formula2: The second mathematical formula in the pair (text). - label: The classification label indicating whether the pair of formulas belong to the same category or not (binary). A label value of 1 indicates that both formulas belong to the same category, while a label value of 0 indicates different categories.

    test.csv

    The purpose of the test.csv file is to provide a set of formula pairs along with their labels and formula name IDs for testing and evaluation purposes. It has an identical structure to train.csv, containing columns like formula1, formula2, label, etc.

    Task

    The main task using this dataset is binary classification, where your objective is to predict whether two mathematical formulas belong to the same category or not based on their textual representation. You can use various machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks for training models on this dataset.

    Exploring & Analyzing Data

    Before building your model, it's crucial to explore and analyze your data. Here are some steps you can take:

    • Load both CSV files (train.csv and test.csv) into your preferred data analysis framework or programming language (e.g., Python with libraries like pandas).
    • Examine the dataset's structure, including the number of rows, columns, and data types.
    • Check for missing values in the dataset and handle them accordingly.
    • Visualize the distribution of labels to understand whether it is balanced or imbalanced.

    Model Building

    Once you have analyzed and preprocessed your dataset, you can start building your classification model using various machine learning algorithms:

    • Split your train.csv data into training and validation sets for model evaluation during training.
    • Choose a suitable

    Research Ideas

    • Math Formula Similarity: This dataset can be used to develop a model that classifies whether two mathematical formulas are similar or not. This can be useful in various applications such as plagiarism detection, identifying duplicate formulas in databases, or suggesting similar formulas based on user input.
    • Formula Categorization: The dataset can be used to train a model that categorizes mathematical formulas into different classes or categories. For example, the model can classify formulas into algebraic expressions, trigonometric equations, calculus problems, or geometric theorems. This categorization can help organize and search through large collections of mathematical formulas.
    • Formula Recommendation: Using this dataset, one could build a recommendation system that suggests related math formulas based on user input. By analyzing the similarities between different formula pairs and their corresponding labels, the system could provide recommendations for relevant mathematical concepts that users may need while solving problems or studying specific topics in mathematics

    Acknowle...

  2. r

    Dataset for The effects of a number line intervention on calculation skills

    • researchdata.edu.au
    • figshare.mq.edu.au
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saskia Kohnen; Rebecca Bull; Carola Ruiz Hornblas (2023). Dataset for The effects of a number line intervention on calculation skills [Dataset]. http://doi.org/10.25949/22799717.V1
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    Macquarie University
    Authors
    Saskia Kohnen; Rebecca Bull; Carola Ruiz Hornblas
    Description

    Study information

    The sample included in this dataset represents five children who participated in a number line intervention study. Originally six children were included in the study, but one of them fulfilled the criterion for exclusion after missing several consecutive sessions. Thus, their data is not included in the dataset.

    All participants were currently attending Year 1 of primary school at an independent school in New South Wales, Australia. For children to be able to eligible to participate they had to present with low mathematics achievement by performing at or below the 25th percentile in the Maths Problem Solving and/or Numerical Operations subtests from the Wechsler Individual Achievement Test III (WIAT III A & NZ, Wechsler, 2016). Participants were excluded from participating if, as reported by their parents, they have any other diagnosed disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, intellectual disability, developmental language disorder, cerebral palsy or uncorrected sensory disorders.

    The study followed a multiple baseline case series design, with a baseline phase, a treatment phase, and a post-treatment phase. The baseline phase varied between two and three measurement points, the treatment phase varied between four and seven measurement points, and all participants had 1 post-treatment measurement point.

    The number of measurement points were distributed across participants as follows:

    Participant 1 – 3 baseline, 6 treatment, 1 post-treatment

    Participant 3 – 2 baseline, 7 treatment, 1 post-treatment

    Participant 5 – 2 baseline, 5 treatment, 1 post-treatment

    Participant 6 – 3 baseline, 4 treatment, 1 post-treatment

    Participant 7 – 2 baseline, 5 treatment, 1 post-treatment

    In each session across all three phases children were assessed in their performance on a number line estimation task, a single-digit computation task, a multi-digit computation task, a dot comparison task and a number comparison task. Furthermore, during the treatment phase, all children completed the intervention task after these assessments. The order of the assessment tasks varied randomly between sessions.


    Measures

    Number Line Estimation. Children completed a computerised bounded number line task (0-100). The number line is presented in the middle of the screen, and the target number is presented above the start point of the number line to avoid signalling the midpoint (Dackermann et al., 2018). Target numbers included two non-overlapping sets (trained and untrained) of 30 items each. Untrained items were assessed on all phases of the study. Trained items were assessed independent of the intervention during baseline and post-treatment phases, and performance on the intervention is used to index performance on the trained set during the treatment phase. Within each set, numbers were equally distributed throughout the number range, with three items within each ten (0-10, 11-20, 21-30, etc.). Target numbers were presented in random order. Participants did not receive performance-based feedback. Accuracy is indexed by percent absolute error (PAE) [(number estimated - target number)/ scale of number line] x100.


    Single-Digit Computation. The task included ten additions with single-digit addends (1-9) and single-digit results (2-9). The order was counterbalanced so that half of the additions present the lowest addend first (e.g., 3 + 5) and half of the additions present the highest addend first (e.g., 6 + 3). This task also included ten subtractions with single-digit minuends (3-9), subtrahends (1-6) and differences (1-6). The items were presented horizontally on the screen accompanied by a sound and participants were required to give a verbal response. Participants did not receive performance-based feedback. Performance on this task was indexed by item-based accuracy.


    Multi-digit computational estimation. The task included eight additions and eight subtractions presented with double-digit numbers and three response options. None of the response options represent the correct result. Participants were asked to select the option that was closest to the correct result. In half of the items the calculation involved two double-digit numbers, and in the other half one double and one single digit number. The distance between the correct response option and the exact result of the calculation was two for half of the trials and three for the other half. The calculation was presented vertically on the screen with the three options shown below. The calculations remained on the screen until participants responded by clicking on one of the options on the screen. Participants did not receive performance-based feedback. Performance on this task is measured by item-based accuracy.


    Dot Comparison and Number Comparison. Both tasks included the same 20 items, which were presented twice, counterbalancing left and right presentation. Magnitudes to be compared were between 5 and 99, with four items for each of the following ratios: .91, .83, .77, .71, .67. Both quantities were presented horizontally side by side, and participants were instructed to press one of two keys (F or J), as quickly as possible, to indicate the largest one. Items were presented in random order and participants did not receive performance-based feedback. In the non-symbolic comparison task (dot comparison) the two sets of dots remained on the screen for a maximum of two seconds (to prevent counting). Overall area and convex hull for both sets of dots is kept constant following Guillaume et al. (2020). In the symbolic comparison task (Arabic numbers), the numbers remained on the screen until a response was given. Performance on both tasks was indexed by accuracy.


    The Number Line Intervention

    During the intervention sessions, participants estimated the position of 30 Arabic numbers in a 0-100 bounded number line. As a form of feedback, within each item, the participants’ estimate remained visible, and the correct position of the target number appeared on the number line. When the estimate’s PAE was lower than 2.5, a message appeared on the screen that read “Excellent job”, when PAE was between 2.5 and 5 the message read “Well done, so close! and when PAE was higher than 5 the message read “Good try!” Numbers were presented in random order.


    Variables in the dataset

    Age = age in ‘years, months’ at the start of the study

    Sex = female/male/non-binary or third gender/prefer not to say (as reported by parents)

    Math_Problem_Solving_raw = Raw score on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).

    Math_Problem_Solving_Percentile = Percentile equivalent on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).

    Num_Ops_Raw = Raw score on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).

    Math_Problem_Solving_Percentile = Percentile equivalent on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).


    The remaining variables refer to participants’ performance on the study tasks. Each variable name is composed by three sections. The first one refers to the phase and session. For example, Base1 refers to the first measurement point of the baseline phase, Treat1 to the first measurement point on the treatment phase, and post1 to the first measurement point on the post-treatment phase.


    The second part of the variable name refers to the task, as follows:

    DC = dot comparison

    SDC = single-digit computation

    NLE_UT = number line estimation (untrained set)

    NLE_T= number line estimation (trained set)

    CE = multidigit computational estimation

    NC = number comparison

    The final part of the variable name refers to the type of measure being used (i.e., acc = total correct responses and pae = percent absolute error).


    Thus, variable Base2_NC_acc corresponds to accuracy on the number comparison task during the second measurement point of the baseline phase and Treat3_NLE_UT_pae refers to the percent absolute error on the untrained set of the number line task during the third session of the Treatment phase.





  3. Landmarks Dataset for sign recognition numbers

    • kaggle.com
    zip
    Updated Nov 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshat Mittu (2022). Landmarks Dataset for sign recognition numbers [Dataset]. https://www.kaggle.com/datasets/akshatmittu/landmarks-dataset-for-sign-recognition-numbers
    Explore at:
    zip(50385 bytes)Available download formats
    Dataset updated
    Nov 4, 2022
    Authors
    Akshat Mittu
    Description

    This dataset was create using hand signs in images and made the landmarks of the same were made into the attributes of the dataset, contains all 21 landmarks of with each coordinate(x,y,z) and 5 classes(1,2,3,4,5).

    You can also add more classes to your dataset by running the following code, make sure to create an empty dataset or append to the dataset here and set the file path correctly

    import numpy as np import pandas as pd import matplotlib.pyplot as plt import mediapipe as mp import cv2 import os

    for t in range(1,6): path = 'data/'+str(t)+'/' images = os.listdir(path) for i in images: image = cv2.imread(path+i) mp_hands = mp.solutions.hands hands = mp_hands.Hands(static_image_mode=False,max_num_hands=1,min_detection_confidence=0.8,min_tracking_confidence=0.8) mp_draw = mp.solutions.drawing_utils image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB) image.flags.writeable=False results = hands.process(image) image.flags.writeable=True ``` if results.multi_hand_landmarks:

        for hand_no, hand_landmarks in enumerate(results.multi_hand_landmarks):
    
          mp_draw.draw_landmarks(image = image, landmark_list = hand_landmarks,
                   connections = mp_hands.HAND_CONNECTIONS)
      a = dict()
      a['label'] = t
      for i in range(21):
        s = ('x','y','z')
        k = (hand_landmarks.landmark[i].x,hand_landmarks.landmark[i].y,hand_landmarks.landmark[i].z)
        for j in range(len(k)):
          a[str(mp_hands.HandLandmark(i).name)+'_'+str(s[j])] = k[j]
      df = df.append(a,ignore_index=True)
    
  4. MathInstruct Dataset: Hybrid Math Instruction

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). MathInstruct Dataset: Hybrid Math Instruction [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathinstruct-dataset-hybrid-math-instruction-tun
    Explore at:
    zip(60239940 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    MathInstruct Dataset: Hybrid Math Instruction Tuning

    A curated dataset for math instruction tuning models

    By TIGER-Lab (From Huggingface) [source]

    About this dataset

    MathInstruct is a comprehensive and meticulously curated dataset specifically designed to facilitate the development and evaluation of models for math instruction tuning. This dataset consists of a total of 13 different math rationale datasets, out of which six have been exclusively curated for this project, ensuring a diverse range of instructional materials. The main objective behind creating this dataset is to provide researchers with an easily accessible and manageable resource that aids in enhancing the effectiveness and precision of math instruction.

    One noteworthy feature of MathInstruct is its lightweight nature, making it highly convenient for researchers to utilize without any hassle. With carefully selected columns such as source, source, output, output, users can readily identify the origin or reference material from where the math instruction was obtained. Additionally, they can also refer to the expected output or solution corresponding to each specific math problem or exercise.

    Overall, MathInstruct offers immense potential in refining hybrid math instruction by facilitating meticulous model development and rigorous evaluation processes. Researchers can leverage this diverse dataset to gain deeper insights into effective teaching methodologies while exploring innovative approaches towards enhancing mathematical learning experiences

    How to use the dataset

    Title: How to Use the MathInstruct Dataset for Hybrid Math Instruction Tuning

    Introduction: The MathInstruct dataset is a comprehensive collection of math instruction examples, designed to assist in developing and evaluating models for math instruction tuning. This guide will provide an overview of the dataset and explain how to make effective use of it.

    • Understanding the Dataset Structure: The dataset consists of a file named train.csv. This CSV file contains the training data, which includes various columns such as source and output. The source column represents the source of math instruction (textbook, online resource, or teacher), while the output column represents expected output or solution to a particular math problem or exercise.

    • Accessing the Dataset: To access the MathInstruct dataset, you can download it from Kaggle's website. Once downloaded, you can read and manipulate the data using programming languages like Python with libraries such as pandas.

    • Exploring the Columns: a) Source Column: The source column provides information about where each math instruction comes from. It may include references to specific textbooks, online resources, or even teachers who provided instructional material. b) Output Column: The output column specifies what students are expected to achieve as a result of each math instruction. It contains solutions or expected outputs for different math problems or exercises.

    • Utilizing Source Information: By analyzing the different sources mentioned in this dataset, researchers can understand which instructional materials are more effective in teaching specific topics within mathematics. They can also identify common strategies used by teachers across multiple sources.

    • Analyzing Expected Outputs: Researchers can study variations in expected outputs for similar types of problems across different sources. This analysis may help identify differences in approaches across textbooks/resources and enrich our understanding of various teaching methods.

    • Model Development and Evaluation: Researchers can utilize this dataset to develop machine learning models that automatically assess whether a given math instruction leads to the expected output. By training models on this data, one can create automated systems that provide feedback on math problems or suggest alternative instruction sources.

    • Scaling the Dataset: Due to its lightweight nature, the MathInstruct dataset is easily accessible and manageable. Researchers can scale up their training data by combining it with other instructional datasets or expand it further by labeling more examples based on similar guidelines.

    Conclusion: The MathInstruct dataset serves as a valuable resource for developing and evaluating models related to math instruction tuning. By analyzing the source information and expected outputs, researchers can gain insights into effective teaching methods and build automated assessment

    Research Ideas

    • Model development: This dataset can be used for developing and training models for math instruction...
  5. Prime Number Source Code with Dataset

    • figshare.com
    zip
    Updated Oct 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Mostafa (2024). Prime Number Source Code with Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27215508.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ayman Mostafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper addresses the computational methods and challenges associated with prime number generation, a critical component in encryption algorithms for ensuring data security. The generation of prime numbers efficiently is a critical challenge in various domains, including cryptography, number theory, and computer science. The quest to find more effective algorithms for prime number generation is driven by the increasing demand for secure communication and data storage and the need for efficient algorithms to solve complex mathematical problems. Our goal is to address this challenge by presenting two novel algorithms for generating prime numbers: one that generates primes up to a given limit and another that generates primes within a specified range. These innovative algorithms are founded on the formulas of odd-composed numbers, allowing them to achieve remarkable performance improvements compared to existing prime number generation algorithms. Our comprehensive experimental results reveal that our proposed algorithms outperform well-established prime number generation algorithms such as Miller-Rabin, Sieve of Atkin, Sieve of Eratosthenes, and Sieve of Sundaram regarding mean execution time. More notably, our algorithms exhibit the unique ability to provide prime numbers from range to range with a commendable performance. This substantial enhancement in performance and adaptability can significantly impact the effectiveness of various applications that depend on prime numbers, from cryptographic systems to distributed computing. By providing an efficient and flexible method for generating prime numbers, our proposed algorithms can develop more secure and reliable communication systems, enable faster computations in number theory, and support advanced computer science and mathematics research.

  6. Math Problems with answers (AIME, IMO)

    • kaggle.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Shperling (2025). Math Problems with answers (AIME, IMO) [Dataset]. https://www.kaggle.com/datasets/dolbokostya/math-problems-with-answers-aime-imo
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Kaggle
    Authors
    Mike Shperling
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset comprises curated mathematical problems and their answers sourced from prestigious competitions such as the American Invitational Mathematics Examination (AIME) and the** International Mathematical Olympiad** (IMO). Designed to challenge both human and machine intelligence, these problems cover a wide range of mathematical disciplines, including algebra, geometry, number theory, and combinatorics.

    The dataset is structured for use in validating and benchmarking large language models (LLMs), assessing their problem-solving abilities, reasoning, and logical inference skills.

  7. MNIST-100

    • kaggle.com
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcin Wierzbiński (2023). MNIST-100 [Dataset]. https://www.kaggle.com/datasets/martininf1n1ty/mnist100
    Explore at:
    zip(23452456 bytes)Available download formats
    Dataset updated
    Jul 25, 2023
    Authors
    Marcin Wierzbiński
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    The MNIST-100 dataset is a variation of the original MNIST dataset, consisting of 100 handwritten numbers extracted from the MNIST dataset. Unlike the traditional MNIST dataset, which contains 60,000 training images of digits from 0 to 9, the Modified MNIST-10 dataset focuses on 100 numbers.

    Dataset Overview: - Dataset Name: MNIST-100 - Total Number of Images: train: 60000 test: 1000 - Classes: 100 (Numbers from 00 to 99) - Image Size: 28x56 pixels (grayscale)

    Data Collection: The MNIST-100 dataset was created by randomly selecting 10 unique digits from the original MNIST dataset. For each selected digit, 10 representative images were extracted, resulting in a total of 100 images. These images were carefully chosen to represent a diverse range of handwriting styles for each digit.

    Each image in the dataset is labeled with its corresponding numbers, ranging from 00 to 99, making it suitable for classification tasks. Researchers and practitioners can use this dataset to train and evaluate machine learning algorithms and neural networks for digit recognition and classification.

    Please note that the Modified MNIST-100 dataset is not intended to replace the original MNIST dataset but serves as a complementary resource for specific applications requiring a smaller and more focused subset of the MNIST data.

    Overall, the MNIST-100 dataset offers a compact and representative collection of 100 handwritten numbers, providing a convenient tool for experimentation and learning in computer vision and pattern recognition.

    Label Distribution for training set:

    LabelOccurrencesLabelOccurrencesLabelOccurrences
    05613462968606
    16873554069582
    25823658870566
    36333761971659
    45883858472572
    55443960973682
    65824057074627
    76154167975598
    85844254476605
    95674356777602
    106414457478595
    117804555579586
    127204655080569
    136994761481628
    146304861482578
    156274959583622
    166845050584569
    177135158385540
    187435251286557
    197065355587628
    205275450488562
    217105548889625
    225865653190600
    235845755691700
    245685849792622
    255305952093622
    266126055694591
    276276168295557
    286186259496580
    296196353997640
    306226461098577
    316846551499563
    3260666587
    3359267655

    Test data:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7193292%2Fac688f2526851734cb50be10f0a7bd7d%2Fpobrane%20(16).png?generation=1690276359580027&alt=media" alt="">

    LabelOccurrencesLabelOccurrencesLabelOccurrences
    0096341006890
    0110835916992
    02913610770102
    03963711271116
    0475389772101
    0585399673106
    0688401037498
    07964112375 ...
  8. d

    Direct (Income) Tax Data: Year-, Taxpayer-Type and Income-Range-wise Total...

    • dataful.in
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataful (Factly) (2025). Direct (Income) Tax Data: Year-, Taxpayer-Type and Income-Range-wise Total Loss Set-off and Number of Returns Filed [Dataset]. https://dataful.in/datasets/20658
    Explore at:
    csv, application/x-parquet, xlsxAvailable download formats
    Dataset updated
    Nov 20, 2025
    Dataset authored and provided by
    Dataful (Factly)
    License

    https://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions

    Area covered
    All India
    Variables measured
    Sum of Total Loss Set Off, Number of Income tax returns filed
    Description

    The dataset contains year-, taxpayer-type-, and income-range-wise total loss set-off and the number of returns filed among different types of direct (income) taxpayers such as Firm Hindu Undivided Family (HUF), Association of Persons (AOP) and Body of Individuals (BOI), Companies, Individuals, etc. The range of income covered in the dataset include rupees zero to 500 crores and above.

    Note: Total Loss Set Off is the sum of current year losses set off and brought forward losses set off against current year’s income in the “Computation of total income” Schedule of return of income

  9. TIGER/Line Shapefile, 2023, County, Somerset County, ME, Address Ranges...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Commerce, U.S. Census Bureau, Geography Division, Geospatial Products Branch (Point of Contact) (2025). TIGER/Line Shapefile, 2023, County, Somerset County, ME, Address Ranges Relationship File [Dataset]. https://catalog.data.gov/dataset/tiger-line-shapefile-2023-county-somerset-county-me-address-ranges-relationship-file
    Explore at:
    Dataset updated
    Aug 11, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Area covered
    Somerset County
    Description

    The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Relationship File (ADDR.dbf) contains the attributes of each address range. Each address range applies to a single edge and has a unique address range identifier (ARID) value. The edge to which an address range applies can be determined by linking the address range to the All Lines Shapefile (EDGES.shp) using the permanent topological edge identifier (TLID) attribute. Multiple address ranges can apply to the same edge since an edge can have multiple address ranges. Note that the most inclusive address range associated with each side of a street edge already appears in the All Lines Shapefile (EDGES.shp). The TIGER/Line Files contain potential address ranges, not individual addresses. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.

  10. TIGER/Line Shapefile, Current, County, DuPage County, IL, Address...

    • catalog.data.gov
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Commerce, U.S. Census Bureau, Geography Division (Point of Contact) (2025). TIGER/Line Shapefile, Current, County, DuPage County, IL, Address Range-Feature [Dataset]. https://catalog.data.gov/dataset/tiger-line-shapefile-current-county-dupage-county-il-address-range-feature
    Explore at:
    Dataset updated
    Aug 8, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Area covered
    DuPage County, Illinois
    Description

    The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) System (MTS). The MTS represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Range Features shapefile contains the geospatial edge geometry and attributes of all unsuppressed address ranges for a county or county equivalent area. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. Single-address address ranges have been suppressed to maintain the confidentiality of the addresses they describe. Multiple coincident address range feature edge records are represented in the shapefile if more than one left or right address ranges are associated to the edge. This shapefile contains a record for each address range to street name combination. Address ranges associated to more than one street name are also represented by multiple coincident address range feature edge records. Note that this shapefile includes all unsuppressed address ranges compared to the All Lines shapefile (edges.shp) which only includes the most inclusive address range associated with each side of a street edge. The TIGER/Line shapefiles contain potential address ranges, not individual addresses. The address ranges in the TIGER/Line shapefiles are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.

  11. f

    Median and (range) for FA, pennation angle, number of fibers, and fiber...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated May 26, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buck, Amanda K. W.; Damon, Bruce M.; Elder, Christopher P.; Ding, Zhaohua; Towse, Theodore F. (2015). Median and (range) for FA, pennation angle, number of fibers, and fiber length. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001943049
    Explore at:
    Dataset updated
    May 26, 2015
    Authors
    Buck, Amanda K. W.; Damon, Bruce M.; Elder, Christopher P.; Ding, Zhaohua; Towse, Theodore F.
    Description
    • indicates a statistical difference (p = 0.009) from unsmoothed (0%) data for the group;^ indicates a statistical difference (p = 0.0022) from unsmoothed (0%) data for the group;# indicates a statistical difference (p = 0.0043) from unsmoothed (0%) data for the group.Median and (range) for FA, pennation angle, number of fibers, and fiber length.
  12. N

    South Range, MI Population Pyramid Dataset: Age Groups, Male and Female...

    • neilsberg.com
    csv, json
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). South Range, MI Population Pyramid Dataset: Age Groups, Male and Female Population, and Total Population for Demographics Analysis [Dataset]. https://www.neilsberg.com/research/datasets/63632866-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Sep 16, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Michigan, South Range
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Total Population for Age Groups, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the three variables, namely (a) male population, (b) female population and (b) total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the data for the South Range, MI population pyramid, which represents the South Range population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey 5-Year estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.

    Key observations

    • Youth dependency ratio, which is the number of children aged 0-14 per 100 persons aged 15-64, for South Range, MI, is 16.9.
    • Old-age dependency ratio, which is the number of persons aged 65 or over per 100 persons aged 15-64, for South Range, MI, is 24.6.
    • Total dependency ratio for South Range, MI is 41.5.
    • Potential support ratio, which is the number of youth (working age population) per elderly, for South Range, MI is 4.1.
    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group for the South Range population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the South Range for the selected age group is shown in the following column.
    • Population (Female): The female population in the South Range for the selected age group is shown in the following column.
    • Total Population: The total population of the South Range for the selected age group is shown in the following column.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for South Range Population by Age. You can refer the same here

  13. h

    Advanced-Math

    • huggingface.co
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    haijian (2025). Advanced-Math [Dataset]. https://huggingface.co/datasets/haijian06/Advanced-Math
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2025
    Authors
    haijian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Here's a concise README for your Advanced-Math dataset:

      Advanced-Math Dataset
    

    This Advanced-Math dataset is designed to support advanced studies and research in various mathematical fields. It encompasses a wide range of topics, including:

    Calculus Linear Algebra Probability Machine Learning Deep Learning

    The dataset primarily focuses on computational problems, which constitute over 80% of the content. Additionally, it includes related logical concept questions to provide a… See the full description on the dataset page: https://huggingface.co/datasets/haijian06/Advanced-Math.

  14. p

    PA Child Care Workforce Age Range Current County Human Services

    • data.pa.gov
    csv, xlsx, xml
    Updated Jul 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Child Development and Early Learning (2025). PA Child Care Workforce Age Range Current County Human Services [Dataset]. https://data.pa.gov/K-12-Education/PA-Child-Care-Workforce-Age-Range-Current-County-H/sv3j-z3tn
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Jul 22, 2025
    Dataset authored and provided by
    Office of Child Development and Early Learning
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Area covered
    Pennsylvania
    Description

    This data set shows the number of individuals in the Pennsylvania child care workforce within specific age ranges as reported in the Professional Development (PD) Registry. Individual ages are calculated based on the birthdate entered in the PD Registry. If the birthdate is blank, data will show as “unknown.” Data is included only for individuals working in family child care, group child care, and center child care. Data is current as of the last day of the quarter prior to the posted report. This report will be updated twice a year. To protect the confidentiality of participants in OCDEL’s programs, it is necessary to limit the amount of data that is available, even in aggregate form. Specifically, counts of 50 or less have been suppressed to protect the confidentiality of individuals (Number is not displayed when count of individuals is less than 51.). DISCLAIMER: OCDEL is not representing that this information is current or accurate beyond the day it was posted. OCDEL shall not be held liable for any improper or incorrect use of the information described and/or contained herein and assumes no responsibility for anyone's use of the information.

  15. TIGER/Line Shapefile, 2023, County, Scott County, MN, Address Ranges...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Aug 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Commerce, U.S. Census Bureau, Geography Division, Geospatial Products Branch (Point of Contact) (2025). TIGER/Line Shapefile, 2023, County, Scott County, MN, Address Ranges Relationship File [Dataset]. https://catalog.data.gov/dataset/tiger-line-shapefile-2023-county-scott-county-mn-address-ranges-relationship-file
    Explore at:
    Dataset updated
    Aug 10, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Area covered
    Scott County
    Description

    The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Relationship File (ADDR.dbf) contains the attributes of each address range. Each address range applies to a single edge and has a unique address range identifier (ARID) value. The edge to which an address range applies can be determined by linking the address range to the All Lines Shapefile (EDGES.shp) using the permanent topological edge identifier (TLID) attribute. Multiple address ranges can apply to the same edge since an edge can have multiple address ranges. Note that the most inclusive address range associated with each side of a street edge already appears in the All Lines Shapefile (EDGES.shp). The TIGER/Line Files contain potential address ranges, not individual addresses. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.

  16. 400k Augmented MNIST: Extended Handwritten Digits

    • kaggle.com
    zip
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre Le Mercier (2025). 400k Augmented MNIST: Extended Handwritten Digits [Dataset]. https://www.kaggle.com/datasets/alexandrelemercier/400k-augmented-mnist-extended-handwritten-digits
    Explore at:
    zip(359213486 bytes)Available download formats
    Dataset updated
    Mar 26, 2025
    Authors
    Alexandre Le Mercier
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    The 400k Augmented MNIST dataset is an extended version of the classic MNIST handwritten digits dataset. By applying a variety of augmentation techniques, I have increased the number of training images to 400,000 - roughly 40,000 per digit label. This large and diverse training set is designed to significantly improve the robustness and generalization of models trained on it, making them less susceptible to overfitting and more resilient against adversarial perturbations.

    Dataset Structure

    The dataset is organized into two main directories:

    • Augmented MNIST Training Set (400k):
      This directory contains 10 subdirectories, one for each digit label ("Label 0" through "Label 9"). Each subdirectory holds the corresponding JPEG images generated via augmentation. These images have been produced using techniques such as random rotation, shear, translation, scaling, reflection, spatial padding, Ben Graham transformation, Gaussian noise, salt-and-pepper noise, and random text overlay.
    • MNIST Validation Set (4k):
      This directory also contains subdirectories "Label 0" to "Label 9". However, the validation set consists solely of the original MNIST images (approximately 400 per label) that were not used for augmentation. This allows you to evaluate model performance on natural, unaltered digit images, providing a clear benchmark for generalization.

    How to Use This Dataset

    1. Training:
      Use the augmented training set to train your deep learning models. The 400k images offer a wide variety of conditions, helping your model learn robust features that generalize well.
    2. Validation:
      Evaluate your models on the validation set, which contains only the original MNIST images. This will help you measure performance on “natural” digits, ensuring that improvements in robustness do not come at the expense of real-world accuracy.
    3. Flexibility:
      You can also experiment with mixed training (combining augmented and original images) to study how different training strategies affect model robustness and accuracy.

    Augmentation Techniques Applied

    The following augmentation functions were used to generate the extended dataset:

    • Random Rotation: Randomly rotates images within a specified angle range.
    • Random Shear: Applies slight shearing transformations.
    • Random Translation: Shifts images horizontally and vertically.
    • Random Scale: Zooms in or out on the images.
    • Ben Graham Transform: Enhances image contrast and clarity using a weighted Gaussian blur.
    • Random Gaussian Noise: Adds Gaussian noise to simulate sensor or environmental disturbances.
    • Random Salt-and-Pepper Noise: Introduces random pixel-level corruption.

    A random number of transformations (between 1 and 6, in a random order) is applied to each image, with the goal of creating a diverse and challenging training set.

    Citation

    If you use this dataset in your research, please cite it as follows:

    @misc{alexandre_le_mercier_2025,
      title={400k Augmented MNIST: Extended Handwritten Digits},
      url={https://www.kaggle.com/ds/6967763},
      DOI={10.34740/KAGGLE/DS/6967763},
      publisher={Kaggle},
      author={Alexandre Le Mercier},
      year={2025}
    }
    

    License

    This dataset is under the Apache 2.0 license.

    Contact

    For any questions or issues regarding this dataset, please send a message in the "Discussions" or "Suggestions" sections of the Kaggle dataset page.

    Good luck and happy coding! 🚀

  17. TIGER/Line Shapefile, 2023, County, Tulsa County, OK, Address Range-Feature

    • catalog.data.gov
    Updated Aug 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Commerce, U.S. Census Bureau, Geography Division, Geospatial Products Branch (Point of Contact) (2025). TIGER/Line Shapefile, 2023, County, Tulsa County, OK, Address Range-Feature [Dataset]. https://catalog.data.gov/dataset/tiger-line-shapefile-2023-county-tulsa-county-ok-address-range-feature
    Explore at:
    Dataset updated
    Aug 10, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Area covered
    Tulsa County, Oklahoma
    Description

    The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Feature Shapefile (ADDRFEAT.dbf) contains the geospatial edge geometry and attributes of all unsuppressed address ranges for a county or county equivalent area. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. Single-address address ranges have been suppressed to maintain the confidentiality of the addresses they describe. Multiple coincident address range feature edge records are represented in the shapefile if more than one left or right address ranges are associated to the edge. The ADDRFEAT shapefile contains a record for each address range to street name combination. Address range associated to more than one street name are also represented by multiple coincident address range feature edge records. Note that the ADDRFEAT shapefile includes all unsuppressed address ranges compared to the All Lines Shapefile (EDGES.shp) which only includes the most inclusive address range associated with each side of a street edge. The TIGER/Line shapefile contain potential address ranges, not individual addresses. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.

  18. N

    Grass Range, MT Population Pyramid Dataset: Age Groups, Male and Female...

    • neilsberg.com
    csv, json
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Grass Range, MT Population Pyramid Dataset: Age Groups, Male and Female Population, and Total Population for Demographics Analysis [Dataset]. https://www.neilsberg.com/research/datasets/6282e8bf-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Sep 16, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Grass Range, Montana
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Total Population for Age Groups, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the three variables, namely (a) male population, (b) female population and (b) total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the data for the Grass Range, MT population pyramid, which represents the Grass Range population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey 5-Year estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.

    Key observations

    • Youth dependency ratio, which is the number of children aged 0-14 per 100 persons aged 15-64, for Grass Range, MT, is 18.6.
    • Old-age dependency ratio, which is the number of persons aged 65 or over per 100 persons aged 15-64, for Grass Range, MT, is 146.5.
    • Total dependency ratio for Grass Range, MT is 165.1.
    • Potential support ratio, which is the number of youth (working age population) per elderly, for Grass Range, MT is 0.7.
    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group for the Grass Range population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Grass Range for the selected age group is shown in the following column.
    • Population (Female): The female population in the Grass Range for the selected age group is shown in the following column.
    • Total Population: The total population of the Grass Range for the selected age group is shown in the following column.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Grass Range Population by Age. You can refer the same here

  19. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  20. New York State Math Test Results (2006-2012)

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassan (2024). New York State Math Test Results (2006-2012) [Dataset]. https://www.kaggle.com/datasets/msjahid/new-york-state-math-test-results-2006-2012
    Explore at:
    zip(6818531 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    Hassan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    New York
    Description

    New York State Math Test Results

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1937611%2Fd08067a1c078259d0026dde197817660%2F_7ceb67e5-e417-4a42-8b01-9b7cb8d78871.jpeg?generation=1715699004202376&alt=media" alt="">

    This dataset presents detailed information on math examination results administered in New York State from 2006 to 2012. It includes the following categories:

    • Report Category: The high-level grouping for each report.
    • Geographic Subdivision: The individual schools or geographic subregion for each of the main categories.
    • Grade: The school grade for which the test was administered.
    • Year: The year the test was administered.
    • Student Category: Reflects the set of students who were tested.
    • Number Tested: Total number of students tested.
    • Mean Scale Score: Average score of total students tested.
    • Num Level 1: Number of students who scored in Level 1 range.
    • Pct Level 1: Percentage of students who scored in Level 1 range.
    • Num Level 2: Number of students who scored in Level 2 range.
    • Pct Level 2: Percentage of students who scored in Level 2 range.
    • Num Level 3: Number of students who scored in Level 3 range.
    • Pct Level 3: Percentage of students who scored in Level 3 range.
    • Num Level 4: Number of students who scored in Level 4 range.
    • Pct Level 4: Percentage of students who scored in Level 4 range.
    • Num Level 3 and 4: Number of students who scored in Level 3 and 4 range combined.
    • Pct Level 3 and 4: Percentage of students who scored in Level 3 and 4 range.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Math Formula Retrieval [Dataset]. https://www.kaggle.com/datasets/thedevastator/math-formula-pair-classification-dataset/data
Organization logo

Math Formula Retrieval

Math Formula Pair Classification Dataset

Explore at:
25 scholarly articles cite this dataset (View in Google Scholar)
zip(2021716728 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Math Formula Retrieval

Math Formula Pair Classification Dataset

By ddrg (From Huggingface) [source]

About this dataset

With a total of six columns, including formula1, formula2, label (binary format), formula1, formula2, and label, the dataset provides all the necessary information for conducting comprehensive analysis and evaluation.

The train.csv file contains a subset of the dataset specifically curated for training purposes. It includes an extensive range of math formula pairs along with their corresponding labels and unique ID names. This allows researchers and data scientists to construct models that can predict whether two given formulas fall within the same category or not.

On the other hand, test.csv serves as an evaluation set. It consists of additional pairs of math formulas accompanied by their respective labels and unique IDs. By evaluating model performance on this test set after training it on train.csv data, researchers can assess how well their models generalize to unseen instances.

By leveraging this informative dataset, researchers can unlock new possibilities in mathematics-related fields such as pattern recognition algorithms development or enhancing educational tools that involve automatic identification and categorization tasks based on mathematical formulas

How to use the dataset

Introduction

Dataset Description

train.csv

The train.csv file contains a set of labeled math formula pairs along with their corresponding labels and formula name IDs. It consists of the following columns: - formula1: The first mathematical formula in the pair (text). - formula2: The second mathematical formula in the pair (text). - label: The classification label indicating whether the pair of formulas belong to the same category or not (binary). A label value of 1 indicates that both formulas belong to the same category, while a label value of 0 indicates different categories.

test.csv

The purpose of the test.csv file is to provide a set of formula pairs along with their labels and formula name IDs for testing and evaluation purposes. It has an identical structure to train.csv, containing columns like formula1, formula2, label, etc.

Task

The main task using this dataset is binary classification, where your objective is to predict whether two mathematical formulas belong to the same category or not based on their textual representation. You can use various machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks for training models on this dataset.

Exploring & Analyzing Data

Before building your model, it's crucial to explore and analyze your data. Here are some steps you can take:

  • Load both CSV files (train.csv and test.csv) into your preferred data analysis framework or programming language (e.g., Python with libraries like pandas).
  • Examine the dataset's structure, including the number of rows, columns, and data types.
  • Check for missing values in the dataset and handle them accordingly.
  • Visualize the distribution of labels to understand whether it is balanced or imbalanced.

Model Building

Once you have analyzed and preprocessed your dataset, you can start building your classification model using various machine learning algorithms:

  • Split your train.csv data into training and validation sets for model evaluation during training.
  • Choose a suitable

Research Ideas

  • Math Formula Similarity: This dataset can be used to develop a model that classifies whether two mathematical formulas are similar or not. This can be useful in various applications such as plagiarism detection, identifying duplicate formulas in databases, or suggesting similar formulas based on user input.
  • Formula Categorization: The dataset can be used to train a model that categorizes mathematical formulas into different classes or categories. For example, the model can classify formulas into algebraic expressions, trigonometric equations, calculus problems, or geometric theorems. This categorization can help organize and search through large collections of mathematical formulas.
  • Formula Recommendation: Using this dataset, one could build a recommendation system that suggests related math formulas based on user input. By analyzing the similarities between different formula pairs and their corresponding labels, the system could provide recommendations for relevant mathematical concepts that users may need while solving problems or studying specific topics in mathematics

Acknowle...

Search
Clear search
Close search
Google apps
Main menu