Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ddrg (From Huggingface) [source]
With a total of six columns, including formula1, formula2, label (binary format), formula1, formula2, and label, the dataset provides all the necessary information for conducting comprehensive analysis and evaluation.
The train.csv file contains a subset of the dataset specifically curated for training purposes. It includes an extensive range of math formula pairs along with their corresponding labels and unique ID names. This allows researchers and data scientists to construct models that can predict whether two given formulas fall within the same category or not.
On the other hand, test.csv serves as an evaluation set. It consists of additional pairs of math formulas accompanied by their respective labels and unique IDs. By evaluating model performance on this test set after training it on train.csv data, researchers can assess how well their models generalize to unseen instances.
By leveraging this informative dataset, researchers can unlock new possibilities in mathematics-related fields such as pattern recognition algorithms development or enhancing educational tools that involve automatic identification and categorization tasks based on mathematical formulas
Introduction
Dataset Description
train.csv
The
train.csvfile contains a set of labeled math formula pairs along with their corresponding labels and formula name IDs. It consists of the following columns: - formula1: The first mathematical formula in the pair (text). - formula2: The second mathematical formula in the pair (text). - label: The classification label indicating whether the pair of formulas belong to the same category or not (binary). A label value of 1 indicates that both formulas belong to the same category, while a label value of 0 indicates different categories.test.csv
The purpose of the
test.csvfile is to provide a set of formula pairs along with their labels and formula name IDs for testing and evaluation purposes. It has an identical structure totrain.csv, containing columns like formula1, formula2, label, etc.Task
The main task using this dataset is binary classification, where your objective is to predict whether two mathematical formulas belong to the same category or not based on their textual representation. You can use various machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks for training models on this dataset.
Exploring & Analyzing Data
Before building your model, it's crucial to explore and analyze your data. Here are some steps you can take:
- Load both CSV files (
train.csvandtest.csv) into your preferred data analysis framework or programming language (e.g., Python with libraries like pandas).- Examine the dataset's structure, including the number of rows, columns, and data types.
- Check for missing values in the dataset and handle them accordingly.
- Visualize the distribution of labels to understand whether it is balanced or imbalanced.
Model Building
Once you have analyzed and preprocessed your dataset, you can start building your classification model using various machine learning algorithms:
- Split your
train.csvdata into training and validation sets for model evaluation during training.- Choose a suitable
- Math Formula Similarity: This dataset can be used to develop a model that classifies whether two mathematical formulas are similar or not. This can be useful in various applications such as plagiarism detection, identifying duplicate formulas in databases, or suggesting similar formulas based on user input.
- Formula Categorization: The dataset can be used to train a model that categorizes mathematical formulas into different classes or categories. For example, the model can classify formulas into algebraic expressions, trigonometric equations, calculus problems, or geometric theorems. This categorization can help organize and search through large collections of mathematical formulas.
- Formula Recommendation: Using this dataset, one could build a recommendation system that suggests related math formulas based on user input. By analyzing the similarities between different formula pairs and their corresponding labels, the system could provide recommendations for relevant mathematical concepts that users may need while solving problems or studying specific topics in mathematics
Facebook
Twitter
The sample included in this dataset represents five children who participated in a number line intervention study. Originally six children were included in the study, but one of them fulfilled the criterion for exclusion after missing several consecutive sessions. Thus, their data is not included in the dataset.
All participants were currently attending Year 1 of primary school at an independent school in New South Wales, Australia. For children to be able to eligible to participate they had to present with low mathematics achievement by performing at or below the 25th percentile in the Maths Problem Solving and/or Numerical Operations subtests from the Wechsler Individual Achievement Test III (WIAT III A & NZ, Wechsler, 2016). Participants were excluded from participating if, as reported by their parents, they have any other diagnosed disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, intellectual disability, developmental language disorder, cerebral palsy or uncorrected sensory disorders.
The study followed a multiple baseline case series design, with a baseline phase, a treatment phase, and a post-treatment phase. The baseline phase varied between two and three measurement points, the treatment phase varied between four and seven measurement points, and all participants had 1 post-treatment measurement point.
The number of measurement points were distributed across participants as follows:
Participant 1 – 3 baseline, 6 treatment, 1 post-treatment
Participant 3 – 2 baseline, 7 treatment, 1 post-treatment
Participant 5 – 2 baseline, 5 treatment, 1 post-treatment
Participant 6 – 3 baseline, 4 treatment, 1 post-treatment
Participant 7 – 2 baseline, 5 treatment, 1 post-treatment
In each session across all three phases children were assessed in their performance on a number line estimation task, a single-digit computation task, a multi-digit computation task, a dot comparison task and a number comparison task. Furthermore, during the treatment phase, all children completed the intervention task after these assessments. The order of the assessment tasks varied randomly between sessions.
Number Line Estimation. Children completed a computerised bounded number line task (0-100). The number line is presented in the middle of the screen, and the target number is presented above the start point of the number line to avoid signalling the midpoint (Dackermann et al., 2018). Target numbers included two non-overlapping sets (trained and untrained) of 30 items each. Untrained items were assessed on all phases of the study. Trained items were assessed independent of the intervention during baseline and post-treatment phases, and performance on the intervention is used to index performance on the trained set during the treatment phase. Within each set, numbers were equally distributed throughout the number range, with three items within each ten (0-10, 11-20, 21-30, etc.). Target numbers were presented in random order. Participants did not receive performance-based feedback. Accuracy is indexed by percent absolute error (PAE) [(number estimated - target number)/ scale of number line] x100.
Single-Digit Computation. The task included ten additions with single-digit addends (1-9) and single-digit results (2-9). The order was counterbalanced so that half of the additions present the lowest addend first (e.g., 3 + 5) and half of the additions present the highest addend first (e.g., 6 + 3). This task also included ten subtractions with single-digit minuends (3-9), subtrahends (1-6) and differences (1-6). The items were presented horizontally on the screen accompanied by a sound and participants were required to give a verbal response. Participants did not receive performance-based feedback. Performance on this task was indexed by item-based accuracy.
Multi-digit computational estimation. The task included eight additions and eight subtractions presented with double-digit numbers and three response options. None of the response options represent the correct result. Participants were asked to select the option that was closest to the correct result. In half of the items the calculation involved two double-digit numbers, and in the other half one double and one single digit number. The distance between the correct response option and the exact result of the calculation was two for half of the trials and three for the other half. The calculation was presented vertically on the screen with the three options shown below. The calculations remained on the screen until participants responded by clicking on one of the options on the screen. Participants did not receive performance-based feedback. Performance on this task is measured by item-based accuracy.
Dot Comparison and Number Comparison. Both tasks included the same 20 items, which were presented twice, counterbalancing left and right presentation. Magnitudes to be compared were between 5 and 99, with four items for each of the following ratios: .91, .83, .77, .71, .67. Both quantities were presented horizontally side by side, and participants were instructed to press one of two keys (F or J), as quickly as possible, to indicate the largest one. Items were presented in random order and participants did not receive performance-based feedback. In the non-symbolic comparison task (dot comparison) the two sets of dots remained on the screen for a maximum of two seconds (to prevent counting). Overall area and convex hull for both sets of dots is kept constant following Guillaume et al. (2020). In the symbolic comparison task (Arabic numbers), the numbers remained on the screen until a response was given. Performance on both tasks was indexed by accuracy.
During the intervention sessions, participants estimated the position of 30 Arabic numbers in a 0-100 bounded number line. As a form of feedback, within each item, the participants’ estimate remained visible, and the correct position of the target number appeared on the number line. When the estimate’s PAE was lower than 2.5, a message appeared on the screen that read “Excellent job”, when PAE was between 2.5 and 5 the message read “Well done, so close! and when PAE was higher than 5 the message read “Good try!” Numbers were presented in random order.
Age = age in ‘years, months’ at the start of the study
Sex = female/male/non-binary or third gender/prefer not to say (as reported by parents)
Math_Problem_Solving_raw = Raw score on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).
Math_Problem_Solving_Percentile = Percentile equivalent on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).
Num_Ops_Raw = Raw score on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).
Math_Problem_Solving_Percentile = Percentile equivalent on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).
The remaining variables refer to participants’ performance on the study tasks. Each variable name is composed by three sections. The first one refers to the phase and session. For example, Base1 refers to the first measurement point of the baseline phase, Treat1 to the first measurement point on the treatment phase, and post1 to the first measurement point on the post-treatment phase.
The second part of the variable name refers to the task, as follows:
DC = dot comparison
SDC = single-digit computation
NLE_UT = number line estimation (untrained set)
NLE_T= number line estimation (trained set)
CE = multidigit computational estimation
NC = number comparison
The final part of the variable name refers to the type of measure being used (i.e., acc = total correct responses and pae = percent absolute error).
Thus, variable Base2_NC_acc corresponds to accuracy on the number comparison task during the second measurement point of the baseline phase and Treat3_NLE_UT_pae refers to the percent absolute error on the untrained set of the number line task during the third session of the Treatment phase.
Facebook
TwitterThis dataset was create using hand signs in images and made the landmarks of the same were made into the attributes of the dataset, contains all 21 landmarks of with each coordinate(x,y,z) and 5 classes(1,2,3,4,5).
You can also add more classes to your dataset by running the following code, make sure to create an empty dataset or append to the dataset here and set the file path correctly
import numpy as np import pandas as pd import matplotlib.pyplot as plt import mediapipe as mp import cv2 import os
for t in range(1,6): path = 'data/'+str(t)+'/' images = os.listdir(path) for i in images: image = cv2.imread(path+i) mp_hands = mp.solutions.hands hands = mp_hands.Hands(static_image_mode=False,max_num_hands=1,min_detection_confidence=0.8,min_tracking_confidence=0.8) mp_draw = mp.solutions.drawing_utils image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB) image.flags.writeable=False results = hands.process(image) image.flags.writeable=True ``` if results.multi_hand_landmarks:
for hand_no, hand_landmarks in enumerate(results.multi_hand_landmarks):
mp_draw.draw_landmarks(image = image, landmark_list = hand_landmarks,
connections = mp_hands.HAND_CONNECTIONS)
a = dict()
a['label'] = t
for i in range(21):
s = ('x','y','z')
k = (hand_landmarks.landmark[i].x,hand_landmarks.landmark[i].y,hand_landmarks.landmark[i].z)
for j in range(len(k)):
a[str(mp_hands.HandLandmark(i).name)+'_'+str(s[j])] = k[j]
df = df.append(a,ignore_index=True)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By TIGER-Lab (From Huggingface) [source]
MathInstruct is a comprehensive and meticulously curated dataset specifically designed to facilitate the development and evaluation of models for math instruction tuning. This dataset consists of a total of 13 different math rationale datasets, out of which six have been exclusively curated for this project, ensuring a diverse range of instructional materials. The main objective behind creating this dataset is to provide researchers with an easily accessible and manageable resource that aids in enhancing the effectiveness and precision of math instruction.
One noteworthy feature of MathInstruct is its lightweight nature, making it highly convenient for researchers to utilize without any hassle. With carefully selected columns such as source, source, output, output, users can readily identify the origin or reference material from where the math instruction was obtained. Additionally, they can also refer to the expected output or solution corresponding to each specific math problem or exercise.
Overall, MathInstruct offers immense potential in refining hybrid math instruction by facilitating meticulous model development and rigorous evaluation processes. Researchers can leverage this diverse dataset to gain deeper insights into effective teaching methodologies while exploring innovative approaches towards enhancing mathematical learning experiences
Title: How to Use the MathInstruct Dataset for Hybrid Math Instruction Tuning
Introduction: The MathInstruct dataset is a comprehensive collection of math instruction examples, designed to assist in developing and evaluating models for math instruction tuning. This guide will provide an overview of the dataset and explain how to make effective use of it.
Understanding the Dataset Structure: The dataset consists of a file named train.csv. This CSV file contains the training data, which includes various columns such as source and output. The source column represents the source of math instruction (textbook, online resource, or teacher), while the output column represents expected output or solution to a particular math problem or exercise.
Accessing the Dataset: To access the MathInstruct dataset, you can download it from Kaggle's website. Once downloaded, you can read and manipulate the data using programming languages like Python with libraries such as pandas.
Exploring the Columns: a) Source Column: The source column provides information about where each math instruction comes from. It may include references to specific textbooks, online resources, or even teachers who provided instructional material. b) Output Column: The output column specifies what students are expected to achieve as a result of each math instruction. It contains solutions or expected outputs for different math problems or exercises.
Utilizing Source Information: By analyzing the different sources mentioned in this dataset, researchers can understand which instructional materials are more effective in teaching specific topics within mathematics. They can also identify common strategies used by teachers across multiple sources.
Analyzing Expected Outputs: Researchers can study variations in expected outputs for similar types of problems across different sources. This analysis may help identify differences in approaches across textbooks/resources and enrich our understanding of various teaching methods.
Model Development and Evaluation: Researchers can utilize this dataset to develop machine learning models that automatically assess whether a given math instruction leads to the expected output. By training models on this data, one can create automated systems that provide feedback on math problems or suggest alternative instruction sources.
Scaling the Dataset: Due to its lightweight nature, the MathInstruct dataset is easily accessible and manageable. Researchers can scale up their training data by combining it with other instructional datasets or expand it further by labeling more examples based on similar guidelines.
Conclusion: The MathInstruct dataset serves as a valuable resource for developing and evaluating models related to math instruction tuning. By analyzing the source information and expected outputs, researchers can gain insights into effective teaching methods and build automated assessment
- Model development: This dataset can be used for developing and training models for math instruction...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper addresses the computational methods and challenges associated with prime number generation, a critical component in encryption algorithms for ensuring data security. The generation of prime numbers efficiently is a critical challenge in various domains, including cryptography, number theory, and computer science. The quest to find more effective algorithms for prime number generation is driven by the increasing demand for secure communication and data storage and the need for efficient algorithms to solve complex mathematical problems. Our goal is to address this challenge by presenting two novel algorithms for generating prime numbers: one that generates primes up to a given limit and another that generates primes within a specified range. These innovative algorithms are founded on the formulas of odd-composed numbers, allowing them to achieve remarkable performance improvements compared to existing prime number generation algorithms. Our comprehensive experimental results reveal that our proposed algorithms outperform well-established prime number generation algorithms such as Miller-Rabin, Sieve of Atkin, Sieve of Eratosthenes, and Sieve of Sundaram regarding mean execution time. More notably, our algorithms exhibit the unique ability to provide prime numbers from range to range with a commendable performance. This substantial enhancement in performance and adaptability can significantly impact the effectiveness of various applications that depend on prime numbers, from cryptographic systems to distributed computing. By providing an efficient and flexible method for generating prime numbers, our proposed algorithms can develop more secure and reliable communication systems, enable faster computations in number theory, and support advanced computer science and mathematics research.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset comprises curated mathematical problems and their answers sourced from prestigious competitions such as the American Invitational Mathematics Examination (AIME) and the** International Mathematical Olympiad** (IMO). Designed to challenge both human and machine intelligence, these problems cover a wide range of mathematical disciplines, including algebra, geometry, number theory, and combinatorics.
The dataset is structured for use in validating and benchmarking large language models (LLMs), assessing their problem-solving abilities, reasoning, and logical inference skills.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
The MNIST-100 dataset is a variation of the original MNIST dataset, consisting of 100 handwritten numbers extracted from the MNIST dataset. Unlike the traditional MNIST dataset, which contains 60,000 training images of digits from 0 to 9, the Modified MNIST-10 dataset focuses on 100 numbers.
Dataset Overview: - Dataset Name: MNIST-100 - Total Number of Images: train: 60000 test: 1000 - Classes: 100 (Numbers from 00 to 99) - Image Size: 28x56 pixels (grayscale)
Data Collection: The MNIST-100 dataset was created by randomly selecting 10 unique digits from the original MNIST dataset. For each selected digit, 10 representative images were extracted, resulting in a total of 100 images. These images were carefully chosen to represent a diverse range of handwriting styles for each digit.
Each image in the dataset is labeled with its corresponding numbers, ranging from 00 to 99, making it suitable for classification tasks. Researchers and practitioners can use this dataset to train and evaluate machine learning algorithms and neural networks for digit recognition and classification.
Please note that the Modified MNIST-100 dataset is not intended to replace the original MNIST dataset but serves as a complementary resource for specific applications requiring a smaller and more focused subset of the MNIST data.
Overall, the MNIST-100 dataset offers a compact and representative collection of 100 handwritten numbers, providing a convenient tool for experimentation and learning in computer vision and pattern recognition.
Label Distribution for training set:
| Label | Occurrences | Label | Occurrences | Label | Occurrences |
|---|---|---|---|---|---|
| 0 | 561 | 34 | 629 | 68 | 606 |
| 1 | 687 | 35 | 540 | 69 | 582 |
| 2 | 582 | 36 | 588 | 70 | 566 |
| 3 | 633 | 37 | 619 | 71 | 659 |
| 4 | 588 | 38 | 584 | 72 | 572 |
| 5 | 544 | 39 | 609 | 73 | 682 |
| 6 | 582 | 40 | 570 | 74 | 627 |
| 7 | 615 | 41 | 679 | 75 | 598 |
| 8 | 584 | 42 | 544 | 76 | 605 |
| 9 | 567 | 43 | 567 | 77 | 602 |
| 10 | 641 | 44 | 574 | 78 | 595 |
| 11 | 780 | 45 | 555 | 79 | 586 |
| 12 | 720 | 46 | 550 | 80 | 569 |
| 13 | 699 | 47 | 614 | 81 | 628 |
| 14 | 630 | 48 | 614 | 82 | 578 |
| 15 | 627 | 49 | 595 | 83 | 622 |
| 16 | 684 | 50 | 505 | 84 | 569 |
| 17 | 713 | 51 | 583 | 85 | 540 |
| 18 | 743 | 52 | 512 | 86 | 557 |
| 19 | 706 | 53 | 555 | 87 | 628 |
| 20 | 527 | 54 | 504 | 88 | 562 |
| 21 | 710 | 55 | 488 | 89 | 625 |
| 22 | 586 | 56 | 531 | 90 | 600 |
| 23 | 584 | 57 | 556 | 91 | 700 |
| 24 | 568 | 58 | 497 | 92 | 622 |
| 25 | 530 | 59 | 520 | 93 | 622 |
| 26 | 612 | 60 | 556 | 94 | 591 |
| 27 | 627 | 61 | 682 | 95 | 557 |
| 28 | 618 | 62 | 594 | 96 | 580 |
| 29 | 619 | 63 | 539 | 97 | 640 |
| 30 | 622 | 64 | 610 | 98 | 577 |
| 31 | 684 | 65 | 514 | 99 | 563 |
| 32 | 606 | 66 | 587 | ||
| 33 | 592 | 67 | 655 |
Test data:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7193292%2Fac688f2526851734cb50be10f0a7bd7d%2Fpobrane%20(16).png?generation=1690276359580027&alt=media" alt="">
| Label | Occurrences | Label | Occurrences | Label | Occurrences |
|---|---|---|---|---|---|
| 00 | 96 | 34 | 100 | 68 | 90 |
| 01 | 108 | 35 | 91 | 69 | 92 |
| 02 | 91 | 36 | 107 | 70 | 102 |
| 03 | 96 | 37 | 112 | 71 | 116 |
| 04 | 75 | 38 | 97 | 72 | 101 |
| 05 | 85 | 39 | 96 | 73 | 106 |
| 06 | 88 | 40 | 103 | 74 | 98 |
| 07 | 96 | 41 | 123 | 75 ... |
Facebook
Twitterhttps://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions
The dataset contains year-, taxpayer-type-, and income-range-wise total loss set-off and the number of returns filed among different types of direct (income) taxpayers such as Firm Hindu Undivided Family (HUF), Association of Persons (AOP) and Body of Individuals (BOI), Companies, Individuals, etc. The range of income covered in the dataset include rupees zero to 500 crores and above.
Note: Total Loss Set Off is the sum of current year losses set off and brought forward losses set off against current year’s income in the “Computation of total income” Schedule of return of income
Facebook
TwitterThe TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Relationship File (ADDR.dbf) contains the attributes of each address range. Each address range applies to a single edge and has a unique address range identifier (ARID) value. The edge to which an address range applies can be determined by linking the address range to the All Lines Shapefile (EDGES.shp) using the permanent topological edge identifier (TLID) attribute. Multiple address ranges can apply to the same edge since an edge can have multiple address ranges. Note that the most inclusive address range associated with each side of a street edge already appears in the All Lines Shapefile (EDGES.shp). The TIGER/Line Files contain potential address ranges, not individual addresses. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.
Facebook
TwitterThe TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) System (MTS). The MTS represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Range Features shapefile contains the geospatial edge geometry and attributes of all unsuppressed address ranges for a county or county equivalent area. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. Single-address address ranges have been suppressed to maintain the confidentiality of the addresses they describe. Multiple coincident address range feature edge records are represented in the shapefile if more than one left or right address ranges are associated to the edge. This shapefile contains a record for each address range to street name combination. Address ranges associated to more than one street name are also represented by multiple coincident address range feature edge records. Note that this shapefile includes all unsuppressed address ranges compared to the All Lines shapefile (edges.shp) which only includes the most inclusive address range associated with each side of a street edge. The TIGER/Line shapefiles contain potential address ranges, not individual addresses. The address ranges in the TIGER/Line shapefiles are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the data for the South Range, MI population pyramid, which represents the South Range population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey 5-Year estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for South Range Population by Age. You can refer the same here
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here's a concise README for your Advanced-Math dataset:
Advanced-Math Dataset
This Advanced-Math dataset is designed to support advanced studies and research in various mathematical fields. It encompasses a wide range of topics, including:
Calculus Linear Algebra Probability Machine Learning Deep Learning
The dataset primarily focuses on computational problems, which constitute over 80% of the content. Additionally, it includes related logical concept questions to provide a… See the full description on the dataset page: https://huggingface.co/datasets/haijian06/Advanced-Math.
Facebook
Twitterhttps://www.usa.gov/government-workshttps://www.usa.gov/government-works
This data set shows the number of individuals in the Pennsylvania child care workforce within specific age ranges as reported in the Professional Development (PD) Registry. Individual ages are calculated based on the birthdate entered in the PD Registry. If the birthdate is blank, data will show as “unknown.” Data is included only for individuals working in family child care, group child care, and center child care. Data is current as of the last day of the quarter prior to the posted report. This report will be updated twice a year. To protect the confidentiality of participants in OCDEL’s programs, it is necessary to limit the amount of data that is available, even in aggregate form. Specifically, counts of 50 or less have been suppressed to protect the confidentiality of individuals (Number is not displayed when count of individuals is less than 51.). DISCLAIMER: OCDEL is not representing that this information is current or accurate beyond the day it was posted. OCDEL shall not be held liable for any improper or incorrect use of the information described and/or contained herein and assumes no responsibility for anyone's use of the information.
Facebook
TwitterThe TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Relationship File (ADDR.dbf) contains the attributes of each address range. Each address range applies to a single edge and has a unique address range identifier (ARID) value. The edge to which an address range applies can be determined by linking the address range to the All Lines Shapefile (EDGES.shp) using the permanent topological edge identifier (TLID) attribute. Multiple address ranges can apply to the same edge since an edge can have multiple address ranges. Note that the most inclusive address range associated with each side of a street edge already appears in the All Lines Shapefile (EDGES.shp). The TIGER/Line Files contain potential address ranges, not individual addresses. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The 400k Augmented MNIST dataset is an extended version of the classic MNIST handwritten digits dataset. By applying a variety of augmentation techniques, I have increased the number of training images to 400,000 - roughly 40,000 per digit label. This large and diverse training set is designed to significantly improve the robustness and generalization of models trained on it, making them less susceptible to overfitting and more resilient against adversarial perturbations.
The dataset is organized into two main directories:
The following augmentation functions were used to generate the extended dataset:
A random number of transformations (between 1 and 6, in a random order) is applied to each image, with the goal of creating a diverse and challenging training set.
If you use this dataset in your research, please cite it as follows:
@misc{alexandre_le_mercier_2025,
title={400k Augmented MNIST: Extended Handwritten Digits},
url={https://www.kaggle.com/ds/6967763},
DOI={10.34740/KAGGLE/DS/6967763},
publisher={Kaggle},
author={Alexandre Le Mercier},
year={2025}
}
This dataset is under the Apache 2.0 license.
For any questions or issues regarding this dataset, please send a message in the "Discussions" or "Suggestions" sections of the Kaggle dataset page.
Good luck and happy coding! 🚀
Facebook
TwitterThe TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Feature Shapefile (ADDRFEAT.dbf) contains the geospatial edge geometry and attributes of all unsuppressed address ranges for a county or county equivalent area. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. Single-address address ranges have been suppressed to maintain the confidentiality of the addresses they describe. Multiple coincident address range feature edge records are represented in the shapefile if more than one left or right address ranges are associated to the edge. The ADDRFEAT shapefile contains a record for each address range to street name combination. Address range associated to more than one street name are also represented by multiple coincident address range feature edge records. Note that the ADDRFEAT shapefile includes all unsuppressed address ranges compared to the All Lines Shapefile (EDGES.shp) which only includes the most inclusive address range associated with each side of a street edge. The TIGER/Line shapefile contain potential address ranges, not individual addresses. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the data for the Grass Range, MT population pyramid, which represents the Grass Range population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey 5-Year estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Grass Range Population by Age. You can refer the same here
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1937611%2Fd08067a1c078259d0026dde197817660%2F_7ceb67e5-e417-4a42-8b01-9b7cb8d78871.jpeg?generation=1715699004202376&alt=media" alt="">
This dataset presents detailed information on math examination results administered in New York State from 2006 to 2012. It includes the following categories:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ddrg (From Huggingface) [source]
With a total of six columns, including formula1, formula2, label (binary format), formula1, formula2, and label, the dataset provides all the necessary information for conducting comprehensive analysis and evaluation.
The train.csv file contains a subset of the dataset specifically curated for training purposes. It includes an extensive range of math formula pairs along with their corresponding labels and unique ID names. This allows researchers and data scientists to construct models that can predict whether two given formulas fall within the same category or not.
On the other hand, test.csv serves as an evaluation set. It consists of additional pairs of math formulas accompanied by their respective labels and unique IDs. By evaluating model performance on this test set after training it on train.csv data, researchers can assess how well their models generalize to unseen instances.
By leveraging this informative dataset, researchers can unlock new possibilities in mathematics-related fields such as pattern recognition algorithms development or enhancing educational tools that involve automatic identification and categorization tasks based on mathematical formulas
Introduction
Dataset Description
train.csv
The
train.csvfile contains a set of labeled math formula pairs along with their corresponding labels and formula name IDs. It consists of the following columns: - formula1: The first mathematical formula in the pair (text). - formula2: The second mathematical formula in the pair (text). - label: The classification label indicating whether the pair of formulas belong to the same category or not (binary). A label value of 1 indicates that both formulas belong to the same category, while a label value of 0 indicates different categories.test.csv
The purpose of the
test.csvfile is to provide a set of formula pairs along with their labels and formula name IDs for testing and evaluation purposes. It has an identical structure totrain.csv, containing columns like formula1, formula2, label, etc.Task
The main task using this dataset is binary classification, where your objective is to predict whether two mathematical formulas belong to the same category or not based on their textual representation. You can use various machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks for training models on this dataset.
Exploring & Analyzing Data
Before building your model, it's crucial to explore and analyze your data. Here are some steps you can take:
- Load both CSV files (
train.csvandtest.csv) into your preferred data analysis framework or programming language (e.g., Python with libraries like pandas).- Examine the dataset's structure, including the number of rows, columns, and data types.
- Check for missing values in the dataset and handle them accordingly.
- Visualize the distribution of labels to understand whether it is balanced or imbalanced.
Model Building
Once you have analyzed and preprocessed your dataset, you can start building your classification model using various machine learning algorithms:
- Split your
train.csvdata into training and validation sets for model evaluation during training.- Choose a suitable
- Math Formula Similarity: This dataset can be used to develop a model that classifies whether two mathematical formulas are similar or not. This can be useful in various applications such as plagiarism detection, identifying duplicate formulas in databases, or suggesting similar formulas based on user input.
- Formula Categorization: The dataset can be used to train a model that categorizes mathematical formulas into different classes or categories. For example, the model can classify formulas into algebraic expressions, trigonometric equations, calculus problems, or geometric theorems. This categorization can help organize and search through large collections of mathematical formulas.
- Formula Recommendation: Using this dataset, one could build a recommendation system that suggests related math formulas based on user input. By analyzing the similarities between different formula pairs and their corresponding labels, the system could provide recommendations for relevant mathematical concepts that users may need while solving problems or studying specific topics in mathematics