100+ datasets found

Math Formula Retrieval
kaggle.com
huggingface.co
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Math Formula Retrieval [Dataset]. https://www.kaggle.com/datasets/thedevastator/math-formula-pair-classification-dataset/data
Explore at:
zip(2021716728 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Math Formula Retrieval

Math Formula Pair Classification Dataset

By ddrg (From Huggingface) [source]

About this dataset

With a total of six columns, including formula1, formula2, label (binary format), formula1, formula2, and label, the dataset provides all the necessary information for conducting comprehensive analysis and evaluation.

The train.csv file contains a subset of the dataset specifically curated for training purposes. It includes an extensive range of math formula pairs along with their corresponding labels and unique ID names. This allows researchers and data scientists to construct models that can predict whether two given formulas fall within the same category or not.

On the other hand, test.csv serves as an evaluation set. It consists of additional pairs of math formulas accompanied by their respective labels and unique IDs. By evaluating model performance on this test set after training it on train.csv data, researchers can assess how well their models generalize to unseen instances.

By leveraging this informative dataset, researchers can unlock new possibilities in mathematics-related fields such as pattern recognition algorithms development or enhancing educational tools that involve automatic identification and categorization tasks based on mathematical formulas

How to use the dataset

Introduction

Dataset Description

train.csv

The train.csv file contains a set of labeled math formula pairs along with their corresponding labels and formula name IDs. It consists of the following columns: - formula1: The first mathematical formula in the pair (text). - formula2: The second mathematical formula in the pair (text). - label: The classification label indicating whether the pair of formulas belong to the same category or not (binary). A label value of 1 indicates that both formulas belong to the same category, while a label value of 0 indicates different categories.

test.csv

The purpose of the test.csv file is to provide a set of formula pairs along with their labels and formula name IDs for testing and evaluation purposes. It has an identical structure to train.csv, containing columns like formula1, formula2, label, etc.

Task

The main task using this dataset is binary classification, where your objective is to predict whether two mathematical formulas belong to the same category or not based on their textual representation. You can use various machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks for training models on this dataset.

Exploring & Analyzing Data

Before building your model, it's crucial to explore and analyze your data. Here are some steps you can take:

Load both CSV files (train.csv and test.csv) into your preferred data analysis framework or programming language (e.g., Python with libraries like pandas).

Examine the dataset's structure, including the number of rows, columns, and data types.

Check for missing values in the dataset and handle them accordingly.

Visualize the distribution of labels to understand whether it is balanced or imbalanced.

Model Building

Once you have analyzed and preprocessed your dataset, you can start building your classification model using various machine learning algorithms:

Split your train.csv data into training and validation sets for model evaluation during training.

Choose a suitable

Research Ideas

Math Formula Similarity: This dataset can be used to develop a model that classifies whether two mathematical formulas are similar or not. This can be useful in various applications such as plagiarism detection, identifying duplicate formulas in databases, or suggesting similar formulas based on user input.

Formula Categorization: The dataset can be used to train a model that categorizes mathematical formulas into different classes or categories. For example, the model can classify formulas into algebraic expressions, trigonometric equations, calculus problems, or geometric theorems. This categorization can help organize and search through large collections of mathematical formulas.

Formula Recommendation: Using this dataset, one could build a recommendation system that suggests related math formulas based on user input. By analyzing the similarities between different formula pairs and their corresponding labels, the system could provide recommendations for relevant mathematical concepts that users may need while solving problems or studying specific topics in mathematics

Acknowle...
m
Dataset for The effects of a number line intervention on calculation skills
figshare.mq.edu.au
researchdata.edu.au
txt
Updated May 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carola Ruiz Hornblas; Saskia Kohnen; Rebecca Bull (2023). Dataset for The effects of a number line intervention on calculation skills [Dataset]. http://doi.org/10.25949/22799717.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25949/22799717.v1
Dataset updated
May 12, 2023
Dataset provided by
Macquarie University
Authors
Carola Ruiz Hornblas; Saskia Kohnen; Rebecca Bull
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Study information The sample included in this dataset represents five children who participated in a number line intervention study. Originally six children were included in the study, but one of them fulfilled the criterion for exclusion after missing several consecutive sessions. Thus, their data is not included in the dataset. All participants were currently attending Year 1 of primary school at an independent school in New South Wales, Australia. For children to be able to eligible to participate they had to present with low mathematics achievement by performing at or below the 25th percentile in the Maths Problem Solving and/or Numerical Operations subtests from the Wechsler Individual Achievement Test III (WIAT III A & NZ, Wechsler, 2016). Participants were excluded from participating if, as reported by their parents, they have any other diagnosed disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, intellectual disability, developmental language disorder, cerebral palsy or uncorrected sensory disorders. The study followed a multiple baseline case series design, with a baseline phase, a treatment phase, and a post-treatment phase. The baseline phase varied between two and three measurement points, the treatment phase varied between four and seven measurement points, and all participants had 1 post-treatment measurement point. The number of measurement points were distributed across participants as follows: Participant 1 – 3 baseline, 6 treatment, 1 post-treatment Participant 3 – 2 baseline, 7 treatment, 1 post-treatment Participant 5 – 2 baseline, 5 treatment, 1 post-treatment Participant 6 – 3 baseline, 4 treatment, 1 post-treatment Participant 7 – 2 baseline, 5 treatment, 1 post-treatment In each session across all three phases children were assessed in their performance on a number line estimation task, a single-digit computation task, a multi-digit computation task, a dot comparison task and a number comparison task. Furthermore, during the treatment phase, all children completed the intervention task after these assessments. The order of the assessment tasks varied randomly between sessions.

Measures Number Line Estimation. Children completed a computerised bounded number line task (0-100). The number line is presented in the middle of the screen, and the target number is presented above the start point of the number line to avoid signalling the midpoint (Dackermann et al., 2018). Target numbers included two non-overlapping sets (trained and untrained) of 30 items each. Untrained items were assessed on all phases of the study. Trained items were assessed independent of the intervention during baseline and post-treatment phases, and performance on the intervention is used to index performance on the trained set during the treatment phase. Within each set, numbers were equally distributed throughout the number range, with three items within each ten (0-10, 11-20, 21-30, etc.). Target numbers were presented in random order. Participants did not receive performance-based feedback. Accuracy is indexed by percent absolute error (PAE) [(number estimated - target number)/ scale of number line] x100.

Single-Digit Computation. The task included ten additions with single-digit addends (1-9) and single-digit results (2-9). The order was counterbalanced so that half of the additions present the lowest addend first (e.g., 3 + 5) and half of the additions present the highest addend first (e.g., 6 + 3). This task also included ten subtractions with single-digit minuends (3-9), subtrahends (1-6) and differences (1-6). The items were presented horizontally on the screen accompanied by a sound and participants were required to give a verbal response. Participants did not receive performance-based feedback. Performance on this task was indexed by item-based accuracy.

Multi-digit computational estimation. The task included eight additions and eight subtractions presented with double-digit numbers and three response options. None of the response options represent the correct result. Participants were asked to select the option that was closest to the correct result. In half of the items the calculation involved two double-digit numbers, and in the other half one double and one single digit number. The distance between the correct response option and the exact result of the calculation was two for half of the trials and three for the other half. The calculation was presented vertically on the screen with the three options shown below. The calculations remained on the screen until participants responded by clicking on one of the options on the screen. Participants did not receive performance-based feedback. Performance on this task is measured by item-based accuracy.

Dot Comparison and Number Comparison. Both tasks included the same 20 items, which were presented twice, counterbalancing left and right presentation. Magnitudes to be compared were between 5 and 99, with four items for each of the following ratios: .91, .83, .77, .71, .67. Both quantities were presented horizontally side by side, and participants were instructed to press one of two keys (F or J), as quickly as possible, to indicate the largest one. Items were presented in random order and participants did not receive performance-based feedback. In the non-symbolic comparison task (dot comparison) the two sets of dots remained on the screen for a maximum of two seconds (to prevent counting). Overall area and convex hull for both sets of dots is kept constant following Guillaume et al. (2020). In the symbolic comparison task (Arabic numbers), the numbers remained on the screen until a response was given. Performance on both tasks was indexed by accuracy.

The Number Line Intervention During the intervention sessions, participants estimated the position of 30 Arabic numbers in a 0-100 bounded number line. As a form of feedback, within each item, the participants’ estimate remained visible, and the correct position of the target number appeared on the number line. When the estimate’s PAE was lower than 2.5, a message appeared on the screen that read “Excellent job”, when PAE was between 2.5 and 5 the message read “Well done, so close! and when PAE was higher than 5 the message read “Good try!” Numbers were presented in random order.

Variables in the dataset Age = age in ‘years, months’ at the start of the study Sex = female/male/non-binary or third gender/prefer not to say (as reported by parents) Math_Problem_Solving_raw = Raw score on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Math_Problem_Solving_Percentile = Percentile equivalent on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Num_Ops_Raw = Raw score on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Math_Problem_Solving_Percentile = Percentile equivalent on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).

The remaining variables refer to participants’ performance on the study tasks. Each variable name is composed by three sections. The first one refers to the phase and session. For example, Base1 refers to the first measurement point of the baseline phase, Treat1 to the first measurement point on the treatment phase, and post1 to the first measurement point on the post-treatment phase.

The second part of the variable name refers to the task, as follows: DC = dot comparison SDC = single-digit computation NLE_UT = number line estimation (untrained set) NLE_T= number line estimation (trained set) CE = multidigit computational estimation NC = number comparison The final part of the variable name refers to the type of measure being used (i.e., acc = total correct responses and pae = percent absolute error).

Thus, variable Base2_NC_acc corresponds to accuracy on the number comparison task during the second measurement point of the baseline phase and Treat3_NLE_UT_pae refers to the percent absolute error on the untrained set of the number line task during the third session of the Treatment phase.
R
Distance Calculation Dataset
universe.roboflow.com
zip
Updated Mar 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jatin-rane (2023). Distance Calculation Dataset [Dataset]. https://universe.roboflow.com/jatin-rane/distance-calculation/model/3
Explore at:
zipAvailable download formats
Dataset updated
Mar 2, 2023
Dataset authored and provided by
jatin-rane
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Vehicles Bounding Boxes
Description
Distance Calculation

## Overview Distance Calculation is a dataset for object detection tasks - it contains Vehicles annotations for 4,056 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
MathInstruct Dataset: Hybrid Math Instruction
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MathInstruct Dataset: Hybrid Math Instruction [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathinstruct-dataset-hybrid-math-instruction-tun
Explore at:
zip(60239940 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MathInstruct Dataset: Hybrid Math Instruction Tuning

A curated dataset for math instruction tuning models

By TIGER-Lab (From Huggingface) [source]

About this dataset

MathInstruct is a comprehensive and meticulously curated dataset specifically designed to facilitate the development and evaluation of models for math instruction tuning. This dataset consists of a total of 13 different math rationale datasets, out of which six have been exclusively curated for this project, ensuring a diverse range of instructional materials. The main objective behind creating this dataset is to provide researchers with an easily accessible and manageable resource that aids in enhancing the effectiveness and precision of math instruction.

One noteworthy feature of MathInstruct is its lightweight nature, making it highly convenient for researchers to utilize without any hassle. With carefully selected columns such as source, source, output, output, users can readily identify the origin or reference material from where the math instruction was obtained. Additionally, they can also refer to the expected output or solution corresponding to each specific math problem or exercise.

Overall, MathInstruct offers immense potential in refining hybrid math instruction by facilitating meticulous model development and rigorous evaluation processes. Researchers can leverage this diverse dataset to gain deeper insights into effective teaching methodologies while exploring innovative approaches towards enhancing mathematical learning experiences

How to use the dataset

Title: How to Use the MathInstruct Dataset for Hybrid Math Instruction Tuning

Introduction: The MathInstruct dataset is a comprehensive collection of math instruction examples, designed to assist in developing and evaluating models for math instruction tuning. This guide will provide an overview of the dataset and explain how to make effective use of it.

Understanding the Dataset Structure: The dataset consists of a file named train.csv. This CSV file contains the training data, which includes various columns such as source and output. The source column represents the source of math instruction (textbook, online resource, or teacher), while the output column represents expected output or solution to a particular math problem or exercise.

Accessing the Dataset: To access the MathInstruct dataset, you can download it from Kaggle's website. Once downloaded, you can read and manipulate the data using programming languages like Python with libraries such as pandas.

Exploring the Columns: a) Source Column: The source column provides information about where each math instruction comes from. It may include references to specific textbooks, online resources, or even teachers who provided instructional material. b) Output Column: The output column specifies what students are expected to achieve as a result of each math instruction. It contains solutions or expected outputs for different math problems or exercises.

Utilizing Source Information: By analyzing the different sources mentioned in this dataset, researchers can understand which instructional materials are more effective in teaching specific topics within mathematics. They can also identify common strategies used by teachers across multiple sources.

Analyzing Expected Outputs: Researchers can study variations in expected outputs for similar types of problems across different sources. This analysis may help identify differences in approaches across textbooks/resources and enrich our understanding of various teaching methods.

Model Development and Evaluation: Researchers can utilize this dataset to develop machine learning models that automatically assess whether a given math instruction leads to the expected output. By training models on this data, one can create automated systems that provide feedback on math problems or suggest alternative instruction sources.

Scaling the Dataset: Due to its lightweight nature, the MathInstruct dataset is easily accessible and manageable. Researchers can scale up their training data by combining it with other instructional datasets or expand it further by labeling more examples based on similar guidelines.

Conclusion: The MathInstruct dataset serves as a valuable resource for developing and evaluating models related to math instruction tuning. By analyzing the source information and expected outputs, researchers can gain insights into effective teaching methods and build automated assessment

Research Ideas

Model development: This dataset can be used for developing and training models for math instruction...
Transient killer whale range - Satellite tagging of West Coast transient...
fisheries.noaa.gov
catalog.data.gov
+1more
Updated Aug 1, 2004
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Hanson (2004). Transient killer whale range - Satellite tagging of West Coast transient killer whales to determine range and movement patterns [Dataset]. https://www.fisheries.noaa.gov/inport/item/17900
Explore at:
Dataset updated
Aug 1, 2004
Dataset provided by
Northwest Fisheries Science Center
Authors
Brad Hanson
Time period covered
Sep 1, 2008 - Dec 3, 2125
Area covered

Description
Transient killers whales inhabit the West Coast of the United States. Their range and movement patterns are difficult to ascertain, but are vital to understanding killer whale population dynamics and abundance trends. Satellite tagging of West Coast transient killer whales to determine range and movement patterns will provide data to assist in understanding transient killer whale populations. L...
housing
kaggle.com
zip
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HappyRautela (2023). housing [Dataset]. https://www.kaggle.com/datasets/happyrautela/housing
Explore at:
zip(809785 bytes)Available download formats
Dataset updated
Sep 22, 2023
Authors
HappyRautela
Description
The exercise after this contains questions that are based on the housing dataset.

How many houses have a waterfront? a. 21000 b. 21450 c. 163 d. 173

How many houses have 2 floors? a. 2692 b. 8241 c. 10680 d. 161

How many houses built before 1960 have a waterfront? a. 80 b. 7309 c. 90 d. 92

What is the price of the most expensive house having more than 4 bathrooms? a. 7700000 b. 187000 c. 290000 d. 399000

For instance, if the ‘price’ column consists of outliers, how can you make the data clean and remove the redundancies? a. Calculate the IQR range and drop the values outside the range. b. Calculate the p-value and remove the values less than 0.05. c. Calculate the correlation coefficient of the price column and remove the values less than the correlation coefficient. d. Calculate the Z-score of the price column and remove the values less than the z-score.

What are the various parameters that can be used to determine the dependent variables in the housing data to determine the price of the house? a. Correlation coefficients b. Z-score c. IQR Range d. Range of the Features

If we get the r2 score as 0.38, what inferences can we make about the model and its efficiency? a. The model is 38% accurate, and shows poor efficiency. b. The model is showing 0.38% discrepancies in the outcomes. c. Low difference between observed and fitted values. d. High difference between observed and fitted values.

If the metrics show that the p-value for the grade column is 0.092, what all inferences can we make about the grade column? a. Significant in presence of other variables. b. Highly significant in presence of other variables c. insignificance in presence of other variables d. None of the above

If the Variance Inflation Factor value for a feature is considerably higher than the other features, what can we say about that column/feature? a. High multicollinearity b. Low multicollinearity c. Both A and B d. None of the above
Data from: Current and projected research data storage needs of Agricultural...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
d
Addresses
catalog.data.gov
data.virginia.gov
+3more
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chesapeake, VA (2025). Addresses [Dataset]. https://catalog.data.gov/dataset/addresses-b8933
Explore at:
Dataset updated
Nov 15, 2025
Dataset provided by
City of Chesapeake, VA
Description
The Address Coordinator in the Planning Department assigns new addresses during the application review process per the address manual. The status field can be used to filter out valid addresses. The True, Multi, Corner (status will be changed to true when the building configuration is identified), and Model values are all valid addresses.Other Statuses:Preliminary - subdivision addresses assigned during the review process and are not official until the plat is recorded; Temporary - addresses assigned to power poles and construction trailers during the building process; Land - addresses for undeveloped properties; Range - used to aid the address coordinator in assigning new addresses when calculating the address range on a new street segment; Retired - former addresses that are no longer in use like shopping center reconfigurations.Use field: This field helps clarify what the address is representing such as power meters for subdivision fountains or pump houses in multi-family developments.The address field is a concatenation of the individual address fields except for the unit number (field name = Units).Address with suffix field concatenates the address number and any suffix for use in geocoders.
R
Golf Ball Distance Calculation Dataset
universe.roboflow.com
zip
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Awais Ahmad (2025). Golf Ball Distance Calculation Dataset [Dataset]. https://universe.roboflow.com/awais-ahmad-dtpcl/golf-ball-distance-calculation/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 14, 2025
Dataset authored and provided by
Awais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Golf Balls Bounding Boxes
Description
Golf Ball Distance Calculation

## Overview Golf Ball Distance Calculation is a dataset for object detection tasks - it contains Golf Balls annotations for 318 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Fused Image dataset for convolutional neural Network-based crack Detection...
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanglian Zhou; Shanglian Zhou; Carlos Canchila; Carlos Canchila; Wei Song; Wei Song (2023). Fused Image dataset for convolutional neural Network-based crack Detection (FIND) [Dataset]. http://doi.org/10.5281/zenodo.6383044
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6383044
Dataset updated
Apr 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shanglian Zhou; Shanglian Zhou; Carlos Canchila; Carlos Canchila; Wei Song; Wei Song
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The “Fused Image dataset for convolutional neural Network-based crack Detection” (FIND) is a large-scale image dataset with pixel-level ground truth crack data for deep learning-based crack segmentation analysis. It features four types of image data including raw intensity image, raw range (i.e., elevation) image, filtered range image, and fused raw image. The FIND dataset consists of 2500 image patches (dimension: 256x256 pixels) and their ground truth crack maps for each of the four data types.

The images contained in this dataset were collected from multiple bridge decks and roadways under real-world conditions. A laser scanning device was adopted for data acquisition such that the captured raw intensity and raw range images have pixel-to-pixel location correspondence (i.e., spatial co-registration feature). The filtered range data were generated by applying frequency domain filtering to eliminate image disturbances (e.g., surface variations, and grooved patterns) from the raw range data [1]. The fused image data were obtained by combining the raw range and raw intensity data to achieve cross-domain feature correlation [2,3]. Please refer to [4] for a comprehensive benchmark study performed using the FIND dataset to investigate the impact from different types of image data on deep convolutional neural network (DCNN) performance.

If you share or use this dataset, please cite [4] and [5] in any relevant documentation.

In addition, an image dataset for crack classification has also been published at [6].

References:

[1] Shanglian Zhou, & Wei Song. (2020). Robust Image-Based Surface Crack Detection Using Range Data. Journal of Computing in Civil Engineering, 34(2), 04019054. https://doi.org/10.1061/(asce)cp.1943-5487.0000873

[2] Shanglian Zhou, & Wei Song. (2021). Crack segmentation through deep convolutional neural networks and heterogeneous image fusion. Automation in Construction, 125. https://doi.org/10.1016/j.autcon.2021.103605

[3] Shanglian Zhou, & Wei Song. (2020). Deep learning–based roadway crack classification with heterogeneous image data fusion. Structural Health Monitoring, 20(3), 1274-1293. https://doi.org/10.1177/1475921720948434

[4] Shanglian Zhou, Carlos Canchila, & Wei Song. (2023). Deep learning-based crack segmentation for civil infrastructure: data types, architectures, and benchmarked performance. Automation in Construction, 146. https://doi.org/10.1016/j.autcon.2022.104678

[5] (This dataset) Shanglian Zhou, Carlos Canchila, & Wei Song. (2022). Fused Image dataset for convolutional neural Network-based crack Detection (FIND) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6383044

[6] Wei Song, & Shanglian Zhou. (2020). Laser-scanned roadway range image dataset (LRRD). Laser-scanned Range Image Dataset from Asphalt and Concrete Roadways for DCNN-based Crack Classification, DesignSafe-CI. https://doi.org/10.17603/ds2-bzv3-nc78
Z
ANN development + final testing datasets
data.niaid.nih.gov
resodate.org
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Authors (2020). ANN development + final testing datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1445865
Explore at:
Dataset updated
Jan 24, 2020
Authors
Authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
File name definitions:

'...v_50_175_250_300...' - dataset for velocity ranges [50, 175] + [250, 300] m/s

'...v_175_250...' - dataset for velocity range [175, 250] m/s

'ANNdevelop...' - used to perform 9 parametric sub-analyses where, in each one, many ANNs are developed (trained, validated and tested) and the one yielding the best results is selected

'ANNtest...' - used to test the best ANN from each aforementioned parametric sub-analysis, aiming to find the best ANN model; this dataset includes the 'ANNdevelop...' counterpart

Where to find the input (independent) and target (dependent) variable values for each dataset/excel ?

input values in 'IN' sheet

target values in 'TARGET' sheet

Where to find the results from the best ANN model (for each target/output variable and each velocity range)?

open the corresponding excel file and the expected (target) vs ANN (output) results are written in 'TARGET vs OUTPUT' sheet

Check reference below (to be added when the paper is published)

https://www.researchgate.net/publication/328849817_11_Neural_Networks_-_Max_Disp_-_Railway_Beams
z
mmWave-based Fitness Activity Recognition Dataset
zenodo.org
png, zip
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yucheng Xie; Xiaonan Guo; Yan Wang; Jerry Cheng; Yingying Chen; Yucheng Xie; Xiaonan Guo; Yan Wang; Jerry Cheng; Yingying Chen (2024). mmWave-based Fitness Activity Recognition Dataset [Dataset]. http://doi.org/10.5281/zenodo.7793613
Explore at:
zip, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7793613
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodo
Authors
Yucheng Xie; Xiaonan Guo; Yan Wang; Jerry Cheng; Yingying Chen; Yucheng Xie; Xiaonan Guo; Yan Wang; Jerry Cheng; Yingying Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
This mmWave Datasets are used for fitness activity identification. This dataset (FA Dataset) contains 14 common fitness daily activities. The data are captured by the mmWave radar TI-AWR1642. The dataset can be used by fellow researchers to reproduce the original work or to further explore other machine-learning problems in the domain of mmWave signals.
Format: .png format
Section 1: Device Configuration
A commodity mmWave radar TI AWR1642, which integrates a 2 × 4 antenna array. The detailed information of it can be found at https://www.ti.com/product/AWR1642#:~:text=The%20AWR1642%20is%20an%20ideal,of%2076%20to%2081%20GHz.
A TI DCA1000EVM data capture card is used to collect data from the mmWave device and send data to a laptop. The detailed information can be found at https://www.ti.com/tool/DCA1000EVM?keyMatch=DCA1000EVM.
mmWave radar work at the frequency in the range of 77~81GHz. The sampling rate is fixed at 100 frames per second and each frame has 17 chirps.
Section 2: Data Format
We provide our mmWave data in heatmaps for this dataset. The data file is in the png format. The details are shown in the following:
14 activities are included in the FA Dataset.
2 participants are included in the FA Dataset.
FA_d_p_i_u_j.png:
d represents the date to collect the fitness data.
p represents the environment to collect the fitness data.
i represents fitness activity type index
u represents user id
j represents sample index
Example:
FA_20220101_lab_1_2_3 represents the 3rd data sample of user 2 of activity 1 collected in the lab
Section 3: Experimental Setup
We place the mmWave device on a table with a height of 60cm.
The participants are asked to perform fitness activity in front of a mmWave device with a distance of 2m.
The data are collected at an lab with a size of (5.0m×3.0m).
Section 4: Data Description
We develop a spatial-temporal heatmap to integrates multiple activity features, including the range of movement, velocity, and time duration of each activity repetition.

We first derive the Doppler-range map of the users' activity by calculating Range-FFT and Doppler-FFT. Then, we generate the spatial-temporal heatmap by accumulating the velocity of every distance in every Doppler-range map together. Next, we normalize the derived velocity information and present the velocity-distance relationship in time dimension. In this way, we transfer the original instantaneous velocity-distance relationship to a more comprehensive spatial-temporal heatmap which describes the process of a whole activity.

As shown in Figure attached, in each spatial-temporal heatmap, the horizontal axis represents the time duration of an activity repetition while the vertical axis represents the range of movement. The velocity is represented by color.

We create 14 zip files to store the the dataset. There are 14 zip files starting with "FA", each contains repetitions from the same fitness activity.
14 common daily activities and their corresponding files
File Name Activity Type File Name Activity Type
FA1 Crunches FA8 Squats
FA2 Elbow plank and reach FA9 Burpees
FA3 Leg raise FA10 Chest squeezes
FA4 Lunges FA11 High knees
FA5 Mountain climber FA12 Side leg raise
FA6 Punches FA13 Side to side chops
FA7 Push ups FA14 Turning kicks

Section 5: Raw Data and Data Processing Algorithms
We also provide the mmWave raw data (.mat format) stored in the same zip file corresponding to the heatmap datasets. Each .mat file can store one set of activity repetitions (e.g., 4 repetations) from a same user.
For example: FA_d_p_i_u_j.mat:
d represents the data to collect the data.
p represents the environment to collect the data.
i represents the activity type index
u represents the user id
j represents the set index
We plan to provide the data processing algorithms (heatmap_generation.py) to load the mmWave raw data and generate the corresponding heatmap data.
Section 6: Citations
If your paper is related to our works, please cite our papers as follows.
https://ieeexplore.ieee.org/document/9868878/
Xie, Yucheng, Ruizhe Jiang, Xiaonan Guo, Yan Wang, Jerry Cheng, and Yingying Chen. "mmFit: Low-Effort Personalized Fitness Monitoring Using Millimeter Wave." In 2022 International Conference on Computer Communications and Networks (ICCCN), pp. 1-10. IEEE, 2022.
Bibtex:
@inproceedings{xie2022mmfit,
title={mmFit: Low-Effort Personalized Fitness Monitoring Using Millimeter Wave},
author={Xie, Yucheng and Jiang, Ruizhe and Guo, Xiaonan and Wang, Yan and Cheng, Jerry and Chen, Yingying},
booktitle={2022 International Conference on Computer Communications and Networks (ICCCN)},
pages={1--10},
year={2022},
organization={IEEE}
}
e
Data from: Analysis of the Scalar and Vector Random Coupling Models For a...
data.europa.eu
demo.researchdata.se
+2more
unknown
Updated Sep 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chalmers tekniska högskola (2023). Analysis of the Scalar and Vector Random Coupling Models For a Four Coupled-Core Fiber [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-5281-zenodo-7895952~~1?locale=mt
Explore at:
unknownAvailable download formats
Dataset updated
Sep 11, 2023
Dataset authored and provided by
Chalmers tekniska högskola
Description
The files with simulation results for ECOC 20223 submission "Analysis of the Scalar and Vector Random Coupling Models For a Four Coupled-Core Fiber". "4CCF_eigenvectorsPol" file is the Mathematica code which enables to calculate supermodes (eigenvectors of M(w)) and their propagation constants of 4-coupled-core fiber (4CCF). These results are uploaded to the python notebook "4CCF_modelingECOC" in order to plot them to get Fig. 2 in the paper. "TransferMatrix" is the python file with functions used for modeling, simulation and plotting. It is also uploaded in the python notebook "4CCF_modelingECOC", where all the calculations for figures in the paper are presented.

! UPD 25.09.2023: There is an error in the formula of birefringence calculation. It is in the function "CouplingCoefficients" in "TransferMatrix" file. There the variable "birefringence" has to be calculated according to the formula (19) [A. Ankiewicz, A. Snyder, and X.-H. Zheng, "Coupling between parallel optical fiber cores–critical examination", Journal of Lightwave Technology, vol. 4, no. 9,pp. 1317–1323, 1986]: (4*U**2*W*spec.k0(W)*spec.kn(2, W_)/(spec.k1(W)*V**4))*((spec.iv(1, W)/spec.k1(W))-(spec.iv(2, W)/spec.k0(W))) The correct formula gives almost the same result (the difference is 10^-5), but one has to use a correct formula anyway. ! UPD 9.12.2023: I have noticed that in the published version of the code I forgot to change the wavelength range for impulse response calculation. So instead of seeing the nice shape as in the paper you will see resolution limited shape. To solve that just change the range of wavelengths, you can add "wl = [1545e-9, 1548e-9]" in the first cell after "Total power impulse response". P.s. In case of any questions or suggestions you are welcome to write me an email ekader@chalmers.se
S
Python numerical computation code for the article of "Numerical study of...
scidb.cn
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Kun (2025). Python numerical computation code for the article of "Numerical study of superradiance and Hawking radiation of rotating acoustic black holes" [Dataset]. http://doi.org/10.57760/sciencedb.24506
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.24506
Dataset updated
Jun 6, 2025
Dataset provided by
Science Data Bank
Authors
Lu Kun
License
https://api.github.com/licenses/mithttps://api.github.com/licenses/mit
Description
This dataset contains Python numerical computation code for studying the phenomena of acoustic superluminescence and Hawking radiation in specific rotating acoustic black hole models. The code is based on the radial wave equation of scalar field (acoustic disturbance) under the effective acoustic metric background derived from analysis. Dataset generation process and processing methods: The core code is written in Python language, using standard scientific computing libraries NumPy and SciPy. The main steps include: (1) defining model parameters (such as A, B, m) and calculation range (frequency $\ omega $from 0.01 to 2.0, turtle coordinates $r ^ * $from -20 to 20); (2) Implement the mutual conversion function between the radial coordinate $r $and the turtle coordinate $r ^ * $, where the inversion of $r ^ * (r) $is numerically solved using SciPy's' optimize.root_scalar 'function (such as Brent's method), and special attention is paid to calculations near the horizon $r_H=| A |/c $to ensure stability; (3) Calculate the effective potential $V_0 (r ^ *, \ omega) $that depends on $r (r ^ *) $; (4) Convert the second-order radial wave equation into a system of quaternion first-order real valued ordinary differential equations; (5) The ODE system was solved using SciPy's' integrate. solve_ivp 'function (using an adaptive step size RK45 method with relative and absolute error margins set to $10 ^ {-8} $), applying pure inward boundary conditions (normalized unit transmission) at the field of view and asymptotic behavior at infinity; (6) Extract the reflection coefficient $\ mathcal {R} $and transmission coefficient $\ mathcal {T} $from the numerical solution; (7) Calculate the Hawking radiation power spectrum $P_ \ omega $based on the derived Hawking temperature $TH $, event horizon angular velocity $\ Omega-H $, Bose Einstein statistics, and combined with the gray body factor $| \ mathcal {T} | ^ 2 $. The calculation process adopts the natural unit system ($\ hbar=k_B=c=1 $) and sets the feature length $r_0=1 $. Dataset content: This dataset mainly includes a Python script file (code for numerical research on superluminescence and Hawking radiation of rotating acoustic black holes. py) and a README documentation file (README. md). The Python script implements the complete calculation process mentioned above. The README file provides a detailed explanation of the code's functionality, the required dependency libraries (Python 3, NumPy, SciPy) for running, the running methods, and the meaning of parameters. This dataset does not contain any raw experimental data and is only theoretical calculation code. Data accuracy and validation: The reliability of the code has been validated through two key indicators: (1) Flow conservation relationship$|\ mathcal{R}|^2 + [(\omega-m\Omega_H)/\omega]|\mathcal{T}|^2 = 1$ The numerical approximation holds within the calculated frequency range (with a deviation typically on the order of $10 ^ {-8} $or less); (2) Under the condition of superluminescence $0<\ omega1 $, which is consistent with theoretical expectations. File format and software: The code is in standard Python 3 (. py) format and can run in any standard Python 3 environment with NumPy and SciPy libraries installed. The README file is in Markdown (. md) format and can be opened with any text editor or Markdown viewer. No special or niche software is required.
Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
d
Data from: Half interpercentile range (half of the difference between the...
catalog.data.gov
data.usgs.gov
+5more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Half interpercentile range (half of the difference between the 16th and 84th percentiles) of wave-current bottom shear stress in the Middle Atlantic Bight for May, 2010 - May, 2011 (MAB_hIPR.SHP) [Dataset]. https://catalog.data.gov/dataset/half-interpercentile-range-half-of-the-difference-between-the-16th-and-84th-percentiles-of
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The U.S. Geological Survey has been characterizing the regional variation in shear stress on the sea floor and sediment mobility through statistical descriptors. The purpose of this project is to identify patterns in stress in order to inform habitat delineation or decisions for anthropogenic use of the continental shelf. The statistical characterization spans the continental shelf from the coast to approximately 120 m water depth, at approximately 5 km resolution. Time-series of wave and circulation are created using numerical models, and near-bottom output of steady and oscillatory velocities and an estimate of bottom roughness are used to calculate a time-series of bottom shear stress at 1-hour intervals. Statistical descriptions such as the median and 95th percentile, which are the output included with this database, are then calculated to create a two-dimensional picture of the regional patterns in shear stress. In addition, time-series of stress are compared to critical stress values at select points calculated from observed surface sediment texture data to determine estimates of sea floor mobility.
Data from: GALILEO VENUS RANGE FIX RAW DATA V1.0
catalog.data.gov
datasets.ai
+1more
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). GALILEO VENUS RANGE FIX RAW DATA V1.0 [Dataset]. https://catalog.data.gov/dataset/galileo-venus-range-fix-raw-data-v1-0-0943a
Explore at:
Dataset updated
Aug 22, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Raw radio tracking data used to determine the precise distance to Venus (and improve knowledge of the Astronomical Unit) from the Galileo flyby on 10 February 1990.
d
Street Network Database SND
catalog.data.gov
data.seattle.gov
+2more
Updated Oct 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2025). Street Network Database SND [Dataset]. https://catalog.data.gov/dataset/street-network-database-snd-1712b
Explore at:
Dataset updated
Oct 4, 2025
Dataset provided by
City of Seattle ArcGIS Online
Description
The pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer. Appropriate uses include: Cartography - Used to depict the City's transportation network location and connections, typically on smaller scaled maps or images where a single line representation is appropriate. Used to depict specific classifications of roadway use, also typically at smaller scales. Used to label transportation network feature names typically on larger scaled maps. Used to label address ranges with associated transportation network features typically on larger scaled maps. Geocode reference - Used as a source for derived reference data for address validation and theoretical address location Address Range data repository - This data store is the City's address range repository defining address ranges in association with transportation network features. Polygon boundary reference - Used to define various area boundaries is other feature classes where coincident with the transportation network. Does not contain polygon features. Address based extracts - Used to create flat-file extracts typically indexed by address with reference to business data typically associated with transportation network features. Thematic linear location reference - By providing unique, stable identifiers for each linear feature, thematic data is associated to specific transportation network features via these identifiers. Thematic intersection location reference - By providing unique, stable identifiers for each intersection feature, thematic data is associated to specific transportation network features via these identifiers. Network route tracing - Used as source for derived reference data used to determine point to point travel paths or determine optimal stop allocation along a travel path. Topological connections with segments - Used to provide a specific definition of location for each transportation network feature. Also provides a specific definition of connection between each transportation network feature. (defines where the streets are and the relationship between them ie. 4th Ave is west of 5th Ave and 4th Ave does intersect with Cherry St) Event location reference - Used as source for derived reference data used to locate event and linear referencing.Data source is TRANSPO.SNDSEG_PV. Updated weekly.
d
Data from: Using high-density SNP genotyping to determine the origin of wild...
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christer Moe Rolandsen; Oddmund Kleven; Lina Arntsen; Göran Bergqvist; Marie L. Davey; Carl Andreas Grøntvedt; Jonas Kindberg; John Odden; Inger Maren Rivrud; Jørgen Rosvold; Neri Horntvedt Thorsen; Erik Ågren; Atle Mysterud (2025). Using high-density SNP genotyping to determine the origin of wild boar dispersers outside the geographic range margins in Norway [Dataset]. http://doi.org/10.5061/dryad.djh9w0w7f
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.djh9w0w7f
Dataset updated
Feb 28, 2025
Dataset provided by
Dryad
Authors
Christer Moe Rolandsen; Oddmund Kleven; Lina Arntsen; Göran Bergqvist; Marie L. Davey; Carl Andreas Grøntvedt; Jonas Kindberg; John Odden; Inger Maren Rivrud; Jørgen Rosvold; Neri Horntvedt Thorsen; Erik Ågren; Atle Mysterud
Time period covered
May 14, 2024
Area covered
Norway
Description
Using high-density SNP genotyping to determine the origin of wild boar dispersers outside the geographic range margins in Norway

https://doi.org/10.5061/dryad.djh9w0w7f

Description of files and file structure

wb_dispersal_script.rmd - R markdown file containing scripts for R and PLINK2. scandinavian_wb_56K.bed - PLINK format .bed file containing genotypes. scandinavian_wb_56K.bim - PLINK format .bim file containing marker positions. scandinavian_wb_56K.fam - PLINK format .fam file containing individual IDs. KING_unfiltered.kin0 - output file from KING-robust algorithm in PLINK2. scandinavian_wb_meta.txt - metadata for individuals containing PLINK names (row names), latitude, longitude, and group (county/area). harvest.zip - zip file containing: - ASFsweden.* - map layer for African swine fever (ASF) outbreak. - wildboar10kvkm.* - map layer for wild boar harvest numbers.

Code/Software

The script was written for a...
d
Data from: Exact finite range DWBA calculations for heavy-ion induced...
elsevier.digitalcommonsdata.com
Updated Jan 1, 1974
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Tamura (1974). Exact finite range DWBA calculations for heavy-ion induced nuclear reactions [Dataset]. http://doi.org/10.17632/xthy9b534c.1
Explore at:
Unique identifier
https://doi.org/10.17632/xthy9b534c.1
Dataset updated
Jan 1, 1974
Authors
T. Tamura
License
https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/
Description
Title of program: MARS-1-FOR-EFR-DWBA Catalogue Id: ABPB_v1_0

Nature of problem The package SATURN-MARS-1 consists of two programs SATURN and MARS for calculating cross sections of reactions transferring nucleon(s) primarily between two heavy ions. The calculations are made within the framework of the finite-range distorted wave Born approximation(DWBA). The first part, SATURN, prepares the form factor(s) either for exact finite (EFR) or for no-recoil (NR) approach. The prepared form factor is then used by the second part MARS to calculate either EFR-DWBA or NR-DWBA cross-s ...

Versions of this program held in the CPC repository in Mendeley Data abpb_v1_0; MARS-1-FOR-EFR-DWBA; 10.1016/0010-4655(74)90012-5

This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Math Formula Retrieval [Dataset]. https://www.kaggle.com/datasets/thedevastator/math-formula-pair-classification-dataset/data

Math Formula Retrieval

Math Formula Pair Classification Dataset

Explore at:

25 scholarly articles cite this dataset (View in Google Scholar)

zip(2021716728 bytes)Available download formats

Dataset updated

Dec 2, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Math Formula Retrieval

Math Formula Pair Classification Dataset

By ddrg (From Huggingface) [source]

About this dataset

With a total of six columns, including formula1, formula2, label (binary format), formula1, formula2, and label, the dataset provides all the necessary information for conducting comprehensive analysis and evaluation.

The train.csv file contains a subset of the dataset specifically curated for training purposes. It includes an extensive range of math formula pairs along with their corresponding labels and unique ID names. This allows researchers and data scientists to construct models that can predict whether two given formulas fall within the same category or not.

On the other hand, test.csv serves as an evaluation set. It consists of additional pairs of math formulas accompanied by their respective labels and unique IDs. By evaluating model performance on this test set after training it on train.csv data, researchers can assess how well their models generalize to unseen instances.

By leveraging this informative dataset, researchers can unlock new possibilities in mathematics-related fields such as pattern recognition algorithms development or enhancing educational tools that involve automatic identification and categorization tasks based on mathematical formulas

How to use the dataset

Introduction

Dataset Description

train.csv

The train.csv file contains a set of labeled math formula pairs along with their corresponding labels and formula name IDs. It consists of the following columns: - formula1: The first mathematical formula in the pair (text). - formula2: The second mathematical formula in the pair (text). - label: The classification label indicating whether the pair of formulas belong to the same category or not (binary). A label value of 1 indicates that both formulas belong to the same category, while a label value of 0 indicates different categories.

test.csv

The purpose of the test.csv file is to provide a set of formula pairs along with their labels and formula name IDs for testing and evaluation purposes. It has an identical structure to train.csv, containing columns like formula1, formula2, label, etc.

Task

The main task using this dataset is binary classification, where your objective is to predict whether two mathematical formulas belong to the same category or not based on their textual representation. You can use various machine learning algorithms such as logistic regression, decision trees, random forests, or neural networks for training models on this dataset.

Exploring & Analyzing Data

Before building your model, it's crucial to explore and analyze your data. Here are some steps you can take:

Load both CSV files (train.csv and test.csv) into your preferred data analysis framework or programming language (e.g., Python with libraries like pandas).

Examine the dataset's structure, including the number of rows, columns, and data types.

Check for missing values in the dataset and handle them accordingly.

Visualize the distribution of labels to understand whether it is balanced or imbalanced.

Model Building

Once you have analyzed and preprocessed your dataset, you can start building your classification model using various machine learning algorithms:

Split your train.csv data into training and validation sets for model evaluation during training.

Choose a suitable

Research Ideas

Math Formula Similarity: This dataset can be used to develop a model that classifies whether two mathematical formulas are similar or not. This can be useful in various applications such as plagiarism detection, identifying duplicate formulas in databases, or suggesting similar formulas based on user input.

Formula Categorization: The dataset can be used to train a model that categorizes mathematical formulas into different classes or categories. For example, the model can classify formulas into algebraic expressions, trigonometric equations, calculus problems, or geometric theorems. This categorization can help organize and search through large collections of mathematical formulas.

Formula Recommendation: Using this dataset, one could build a recommendation system that suggests related math formulas based on user input. By analyzing the similarities between different formula pairs and their corresponding labels, the system could provide recommendations for relevant mathematical concepts that users may need while solving problems or studying specific topics in mathematics

Acknowle...

Clear search

Close search

Google apps

Main menu

Math Formula Retrieval

Math Formula Retrieval

Math Formula Pair Classification Dataset

About this dataset

How to use the dataset

Introduction

Dataset Description

train.csv

test.csv

Task

Exploring & Analyzing Data

Model Building

Research Ideas

Acknowle...

Dataset for The effects of a number line intervention on calculation skills

Distance Calculation Dataset

Distance Calculation

MathInstruct Dataset: Hybrid Math Instruction

MathInstruct Dataset: Hybrid Math Instruction Tuning

A curated dataset for math instruction tuning models

About this dataset

How to use the dataset

Research Ideas

Transient killer whale range - Satellite tagging of West Coast transient...

housing

Data from: Current and projected research data storage needs of Agricultural...

Addresses

Golf Ball Distance Calculation Dataset

Golf Ball Distance Calculation

Fused Image dataset for convolutional neural Network-based crack Detection...

ANN development + final testing datasets

mmWave-based Fitness Activity Recognition Dataset

Data from: Analysis of the Scalar and Vector Random Coupling Models For a...

Python numerical computation code for the article of "Numerical study of...

Collection of example datasets used for the book - R Programming -...

Data from: Half interpercentile range (half of the difference between the...

Data from: GALILEO VENUS RANGE FIX RAW DATA V1.0

Street Network Database SND

Data from: Using high-density SNP genotyping to determine the origin of wild...

Using high-density SNP genotyping to determine the origin of wild boar dispersers outside the geographic range margins in Norway

Description of files and file structure

Code/Software

Data from: Exact finite range DWBA calculations for heavy-ion induced...

Math Formula Retrieval

Math Formula Pair Classification Dataset

Math Formula Retrieval

Math Formula Pair Classification Dataset

About this dataset

How to use the dataset

Introduction

Dataset Description

train.csv

test.csv

Task

Exploring & Analyzing Data

Model Building

Research Ideas

Acknowle...