100+ datasets found

Mathematical Problems Dataset: Various
kaggle.com
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mathematical Problems Dataset: Various [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathematical-problems-dataset-various-mathematic/code
Explore at:
zip(2498203187 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Mathematical Problems Dataset: Various Mathematical Problems and Solutions

Mathematical Problems Dataset: Questions and Answers

By math_dataset (From Huggingface) [source]

About this dataset

This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models

How to use the dataset

Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.

Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:

question: This column contains the text representation of the mathematical problem or equation.

answer: This column contains the text representation of the solution or answer to the corresponding problem.

Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.

Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.

Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.

Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.

Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.

Note: Please note that the dataset does not include dates.

By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!

Research Ideas

Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.

Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.

Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...
Number Words Dataset
kaggle.com
zip
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashutosh_kun (2024). Number Words Dataset [Dataset]. https://www.kaggle.com/datasets/ashutoshkun/number-words-dataset
Explore at:
zip(4704889 bytes)Available download formats
Dataset updated
Apr 25, 2024
Authors
Ashutosh_kun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description: This dataset contains images of numbers written in words from one to fifty (one, ONE, One, two, TWO, Two, …….). Each image is stored in their respective folders named as (one,two,three….) .

Content: Images: The dataset includes images of numbers written in words from one to hundred in various formats and styles. Images are provided in JPG, JPEG, PNG format.

Usage: This dataset can be used to develop machine learning models for optical character recognition (OCR) tasks or Image Classification. The goal is to train a model that can predict what is written in words when given an image containing the word.

Acknowledgements: This dataset was created for the purpose of solving the problem statement: "Develop a machine-learning model to train with images of numbers written in words from one to fifty.
h
math-problems-greedy-vs-best-of-n
huggingface.co
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeynep (2025). math-problems-greedy-vs-best-of-n [Dataset]. https://huggingface.co/datasets/Tandogan/math-problems-greedy-vs-best-of-n
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 29, 2025
Authors
Zeynep
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Problem Solving Math Dataset - Greedy vs Best-of-N

This dataset contains mathematical problems and their solutions generated using two decoding strategies:

Greedy Decoding: Generates a single deterministic solution. Best-of-N Decoding: Generates N solutions and selects the best one based on a scoring metric.

Dataset Structure

This dataset is created with a filtered subset of 20 level 1-3 problems from the MATH-500 dataset. To have a balance across the levels, the… See the full description on the dataset page: https://huggingface.co/datasets/Tandogan/math-problems-greedy-vs-best-of-n.
Z
Dataset of Video Comments of a Vision Video Classified by Their Relevance,...
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karras, Oliver; Kristo, Eklekta (2024). Dataset of Video Comments of a Vision Video Classified by Their Relevance, Polarity, Intention, and Topic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4533301
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Leibniz University Hannover
TIB - Leibniz Information Centre for Science and Technology
Authors
Karras, Oliver; Kristo, Eklekta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all comments (comments and replies) of the YouTube vision video "Tunnels" by "The Boring Company" fetched on 2020-10-13 using YouTube API. The comments are classified manually by three persons. We performed a single-class labeling of the video comments regarding their relevance for requirement engineering (RE) (ham/spam), their polarity (positive/neutral/negative). Furthermore, we performed a multi-class labeling of the comments regarding their intention (feature request and problem report) and their topic (efficiency and safety). While a comment can only be relevant or not relevant and have only one polarity, a comment can have one or more intentions and also one or more topics.

For the replies, one person also classified them regarding their relevance for RE. However, the investigation of the replies is ongoing and future work.

Remark: For 126 comments and 26 replies, we could not determine the date and time since they were no longer accessible on YouTube at the time this data set was created. In the case of a missing date and time, we inserted "NULL" in the corresponding cell.

This data set includes the following files:

Dataset.xlsx contains the raw and labeled video comments and replies:

For each comment, the data set contains:

ID: An identification number generated by YouTube for the comment

Date: The date and time of the creation of the comment

Author: The username of the author of the comment

Likes: The number of likes of the comment

Replies: The number of replies to the comment

Comment: The written comment

Relevance: Label indicating the relevance of the comment for RE (ham = relevant, spam = irrelevant)

Polarity: Label indicating the polarity of the comment

Feature request: Label indicating that the comment request a feature

Problem report: Label indicating that the comment reports a problem

Efficiency: Label indicating that the comment deals with the topic efficiency

Safety: Label indicating that the comment deals with the topic safety

For each reply, the data set contains:

ID: The identification number of the comment to which the reply belongs

Date: The date and time of the creation of the reply

Author: The username of the author of the reply

Likes: The number of likes of the reply

Comment: The written reply

Relevance: Label indicating the relevance of the reply for RE (ham = relevant, spam = irrelevant)

Detailed analysis results.xlsx contains the detailed results of all ten times repeated 10-fold cross validation analyses for each of all considered combinations of machine learning algorithms and features

Guide Sheet - Multi-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual multi-class labeling

Guide Sheet - Single-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual single-class labeling

Python scripts for analysis.zip contains the scripts (as jupyter notebooks) and prepared data (as csv-files) for the analyses
Room Assignment problem
kaggle.com
zip
Updated Oct 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Sepulveda (2022). Room Assignment problem [Dataset]. https://www.kaggle.com/datasets/kathuman/room-assignment-problem
Explore at:
zip(6030 bytes)Available download formats
Dataset updated
Oct 19, 2022
Authors
Daniel Sepulveda
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains information about a number of participants (participants.csv) to a workshop that need to be assigned to a number of rooms (rooms.csv).

Restrictions: 1.- The workshop has 5 different activities 2.- Each participant has indicated their first, second and third preferences for the activities available (Priority1, Priority2 and Priority3 columns in participants.csv) 3.- Participants are part of teams (Team column in participant.csv) and should be assigned together 4.- Each Activity lasts for half a day, and each participant will take part in one activity in the morning and one activity in the afternoon. 5.- Each Room must contain the SAME activity in the morning and in the afternoon.

Requirements A.- Define the way i which each participant should be assigned through a csv file in the format Name;ActivityAM;RoomAM, ActivityPM;RoomPM B.- Maximize the number of people getting their 1st and 2nd preferences.
h
github-issues-dataset
huggingface.co
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharjeel Yunus (2025). github-issues-dataset [Dataset]. https://huggingface.co/datasets/sharjeelyunus/github-issues-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 29, 2025
Authors
Sharjeel Yunus
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📌 GitHub Issues Dataset

📂 Dataset Name: github-issues-dataset📊 Total Issues: 114073📜 Format: Parquet (.parquet)🔍 Source: GitHub Repositories (Top 100 Repos)

📖 Overview

This dataset contains 114,073 GitHub issues collected from the top 100 repositories on GitHub.It is designed for issue classification, severity/priority prediction, and AI/ML training.

✅ This dataset is useful for:

AI/ML Training: Fine-tune models for issue classification &… See the full description on the dataset page: https://huggingface.co/datasets/sharjeelyunus/github-issues-dataset.
Z
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
Explore at:
Dataset updated
Apr 2, 2024
Dataset provided by
German Cancer Research Center
Max Delbrück Center
Howard Hughes Medical Institute - Janelia Research Campus
Max Delbrück Center for Molecular Medicine
Authors
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

How to open zarr files

Install the python zarr package:

pip install zarr

Opened a zarr file with:

import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

optional:import numpy as npraw_np = np.array(raw)

Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:

pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

Execute:

python view_data.py /R9F03-20181030_62_B5.zarr

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
m
Dataset of Math Word Problems In Spanish and MathML
data.mendeley.com
Updated May 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Sossa (2024). Dataset of Math Word Problems In Spanish and MathML [Dataset]. http://doi.org/10.17632/skbvhkz5th.1
Explore at:
Unique identifier
https://doi.org/10.17632/skbvhkz5th.1
Dataset updated
May 30, 2024
Authors
Kevin Sossa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains 150 Math Word Problems(MWP). Each problem consists of textual math problems that involve the application of first and second-degree mathematical equations for their resolution. To create this set, academic and educational sources containing first and second-degree math problems were selected, and some original problems were also included.

Each problem in the dataset is structured as follows:

"question": A textual description of the math problem in Spanish "mathml_equations": The corresponding equation for the problem, expressed in MathML format to facilitate processing and manipulation by machine learning models. "Difficulty": The number of variables in the equation. "Grade": The grade of the equation, with 1 indicating a linear equation and 2 indicating a quadratic equation. "Index: A unique identifier for each problem in the dataset. "Author": The creator or source of the problem. "Ref": The source or citation for the problem, if applicable.
d
SIAM 2007 Text Mining Competition dataset
catalog.data.gov
data.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
h
math-problem-explanations-dataset
huggingface.co
Updated Dec 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
harish (2024). math-problem-explanations-dataset [Dataset]. https://huggingface.co/datasets/harishkotra/math-problem-explanations-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2024
Authors
harish
Description
Dataset Card for math-problem-explanations-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/hk-gaianet/math-problem-explanations-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/harishkotra/math-problem-explanations-dataset.
Z
Data from: An Open-set Recognition and Few-Shot Learning Dataset for Audio...
data.niaid.nih.gov
data.europa.eu
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Naranjo-Alcazar; Sergi Perez-Castanos; Pedro Zuccarello; Maximo Cobos (2024). An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3689287
Explore at:
Dataset updated
May 21, 2024
Dataset provided by
Visualfy
Universitat de Valencia
Authors
Javier Naranjo-Alcazar; Sergi Perez-Castanos; Pedro Zuccarello; Maximo Cobos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The problem of training a deep neural network with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications are those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at providing the audio recognition community with a carefully annotated dataset for FSL and OSR comprised of 1360 clips from 34 classes divided into pattern sounds and unwanted sounds. To facilitate and promote research in this area, results with two baseline systems (one trained from scratch and another based on transfer learning), are presented.
d
EMS - Top Ten Dispatch Problems by Fiscal Year
catalog.data.gov
data.austintexas.gov
Updated Oct 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.austintexas.gov (2025). EMS - Top Ten Dispatch Problems by Fiscal Year [Dataset]. https://catalog.data.gov/dataset/ems-top-ten-dispatch-problems-by-fiscal-year
Explore at:
Dataset updated
Oct 25, 2025
Dataset provided by
data.austintexas.gov
Description
This table shows the 10 most frequently recorded incident problem types as recorded by communications personnel for each fiscal year presented.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
o
madelon
openml.org
Updated May 22, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). madelon [Dataset]. https://www.openml.org/d/1485
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2015
Description
Author: Isabelle Guyon
Source: UCI
Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

Abstract:

MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Source:

Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

Data Set Information:

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

There is no attribute information provided to avoid biasing the feature selection process.

Relevant Papers:

The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.
GENEA Challenge 2022 Dataset Files
zenodo.org
txt, zip
Updated Oct 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pieter Wolfert; Pieter Wolfert (2022). GENEA Challenge 2022 Dataset Files [Dataset]. http://doi.org/10.5281/zenodo.6998231
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6998231
Dataset updated
Oct 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pieter Wolfert; Pieter Wolfert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Zenodo repository contains the main dataset for the GENEA 2022 challenge, which is based on the Talking With Hands 16.2M data.

Contents:

The "trn" and "val" zip files contain audio files (in WAV format), time-aligned transcriptions (in TSV format), and motion files (in BVH format) for the training and validation datasets, respectively.

The "tst" zip file contains audio files (in WAV format) and transcriptions (in TSV format) for the test set, but no motion. The corresponding test motion is available at:

https://doi.org/10.5281/zenodo.6976463

Each zip file also contains a "metadata.csv" file that contains information for all files regarding the speaker ID and whether or not the motion files contain finger motion.

Note that the speech audio in the data sometimes has been replaced by silence for the purpose of anonymisation.

Data processing scripts:

We provide a number of optional scripts for encoding and processing the challenge data:

Audio: Scripts for extracting basic audio features, such as spectrograms, prosodic features, and mel-frequency cepstral coefficients (MFCCs) can be found at this link.

Text: A script to encode text transcriptions to word vectors using FastText is available. tsv2wordvectors.py

Motion: If you wish to encode the joint angles from the BVH files to and from an exponential map representation, you can use scripts by Simon Alexanderson based on the PyMo library, which are available here:

bvh2features.py

features2bvh.py

Attribution:

If you use this material, please cite our latest paper on the GENEA Challenge 2022. At the time of writing (2022-08-16), that is our ACM ICMI 2022 paper:

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI '22). ACM.

You can find the latest information and a BibTeX file on the project website:

https://youngwoo-yoon.github.io/GENEAchallenge2022/

Also cite the paper about the original dataset from Meta Research:

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19). IEEE, 763–772.

The motion and audio files are based on the Talking With Hands 16.2M dataset at https://github.com/facebookresearch/TalkingWithHands32M/. All material is available under a CC BY NC 4.0 international license, with the text provided in LICENSE.txt.

To find more GENEA Challenge 2022 material on the web, please see:

* https://youngwoo-yoon.github.io/GENEAchallenge2022/

* https://genea-workshop.github.io/2022/challenge/

If you have any questions or comments, please contact:

* The GENEA Challenge & Workshop organisers
n
Data for: Permutation Flow Shop Scheduling with Multiple Lines and Demand...
narcis.nl
data.mendeley.com
Updated Jun 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brammer, J (via Mendeley Data) (2021). Data for: Permutation Flow Shop Scheduling with Multiple Lines and Demand Plans Using Reinforcement Learning [Dataset]. http://doi.org/10.17632/5txxwj2g6b.3
Explore at:
Unique identifier
https://doi.org/10.17632/5txxwj2g6b.3
Dataset updated
Jun 1, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Brammer, J (via Mendeley Data)
Description
Description of dataset

This repository contains the datasets used in our study "Permutation Flow Shop Scheduling with Multiple Lines and Demand Plans Using Reinforcement Learning".

The repository provides three datasets. The main dataset in the folder data contains 1050 problem instances for the multi-line permutation flow shop problem. The dataset extends the dataset introduced by Taillard (1993) to multiple production lines and demand plans. The processing times were sampled from the uniform interval [1,99]. In addition, we sampled the demand plans from a multinomial distribution with equal probability for each job type, which leads to rather balanced demand plans.

The folders data_lin and data_exp each contain 150 problem instances with more imbalanced demand plans, where the probabilities of each job type decrease linearly or exponentially. However, the quantity to be produced for each job type is greater than zero.

Dataset structure

Each dataset of the main study is structured in 15 subfolders. Each folder contains problem instances for a combination of layout and processing time variation.

Folder name notation: Tai_PFSP_L_

A: Number of production lines (1-3) B: Processing time variation (1-5)

All of these folders contain 70 problem instances. A problem file is a combination of the problem layout (sequence length, number of machines and stations) and the demand plan variation. The processing times are fixed for one problem characteristic, but the ten demand plans are different.

File name notation: t

C: Number of production lines (1-3) D: Identifier of the layout (1-7) E: Sequence length (20,100,500) F: Number of machines per line (5,10,20) G: Number of job types (5,10,20) H: Number of demand plan variation (1-10)

Each file represents a different problem in text format with the following notation:

Line 1: Demand plan Line 2: Layout Type Line 3: Number of machines Line 4: Number of machines per line Line 5: Number of total machines with synchronization machine Line 6: Number of job types Line 7-end: Processing times in matrix form for machines (rows) and job types (columns)
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
r
QoG Standard Dataset
researchdata.se
gimi9.com
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Teorell; Aksel Sundström; Sören Holmberg; Bo Rothstein; Natalia Alvarado Pachon; Cem Mert Dalli (2024). QoG Standard Dataset [Dataset]. http://doi.org/10.18157/QoGStdJan22
Explore at:
(129777582)Available download formats
Unique identifier
https://doi.org/10.18157/QoGStdJan22
Dataset updated
Aug 6, 2024
Dataset provided by
University of Gothenburg
Authors
Jan Teorell; Aksel Sundström; Sören Holmberg; Bo Rothstein; Natalia Alvarado Pachon; Cem Mert Dalli
Time period covered
1946
Description
The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.

The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.

QoG Standard Dataset is the largest dataset consisting of more than 2,000 variables from sources related to the Quality of Government. The data exist in both time-series (year 1946 and onwards) and cross-section (year 2020). Many of the variables are available in both datasets, but some are not. The datasets draws on a number of freely available data sources related to QoG and its correlates.

In the QoG Standard CS dataset, data from and around 2020 is included. Data from 2020 is prioritized; however, if no data is available for a country for 2020, data for 2021 is included. If no data exists for 2021, data for 2019 is included, and so on up to a maximum of +/- 3 years.

In the QoG Standard TS dataset, data from 1946 and onwards is included and the unit of analysis is country-year (e.g., Sweden-1946, Sweden-1947, etc.).
Chemistry Problem-Solution
kaggle.com
zip
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Chemistry Problem-Solution [Dataset]. https://www.kaggle.com/datasets/thedevastator/chemistry-problem-solution-dataset
Explore at:
zip(9075076 bytes)Available download formats
Dataset updated
Dec 1, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Chemistry Problem-Solution

Chemistry Problem-Solution Dataset: 20K pairs across 25 topics and subtopics

By camel-ai (From Huggingface) [source]

About this dataset

To ensure diversity and coverage across various aspects of chemistry, this dataset spans across 25 main topics, encompassing a wide range of subtopics within each main topic. Each main topic and subtopic combination contains an extensive set of 32 distinct problems for analysis and study.

In order to facilitate efficient data exploration and analysis, the dataset is structured with essential columns including 'role_1' which signifies the role or identity responsible for presenting either the problem statement or solution. Additionally, 'sub_topic' denotes the specific subarea within each main topic to which both problem and solution belong.

By utilizing this expansive dataset containing accurate problem statements and their corresponding solutions from diverse topics in chemistry along with their categorization into distinct domains (both main topics and subtopics), users can seamlessly navigate through specific areas of interest while making informed decisions about which subsets they'd like to explore further based on their project requirements or learning objectives.

Please note that since generating this dataset was performed using GPT-4 model powered by artificial intelligence algorithms it's critical to conduct careful validation checks when implementing these data points in real-life scenarios or academic research work where precision plays a vital role

How to use the dataset

About the Dataset

The dataset contains 20,000 pairs of problem statements and their corresponding solutions, covering a wide range of topics within the field of chemistry. These pairs have been generated using the GPT-4 model, ensuring that they are diverse and representative of various concepts in chemistry.

Main Topics and Subtopics

The dataset is organized into 25 main topics, with each topic having 25 subtopics. The main topics represent broader areas within chemistry, while the subtopics narrow down to specific subjects within each main topic. This hierarchical structure allows for better categorization and navigation through different aspects of chemistry problems.

Problem Statement

The problem statement (message_1) column provides a concise description or statement of a specific chemistry problem. It sets up the context for understanding what needs to be solved or analyzed.

Solution

The solution (message_2) column contains the respective answer or solution to each problem statement. It offers insights into how to approach and solve specific types of chemistry problems.

How to Utilize this Dataset

Here are some ways you can leverage this dataset:

Study Specific Topics: Since there are 25 main topics with multiple subtopics in this dataset, you can focus on exploring certain areas that interest you or align with your learning goals in chemistry.

Develop Learning Resources: As an educator or content creator, you can use this dataset as inspiration for creating educational materials such as textbooks, online courses, or lesson plans focused on different topics within chemistry.

Build Intelligent Systems: If you're working on developing AI-powered systems related to solving chemistry problems or providing chemical insights, this dataset can serve as training data for your models.

Evaluate Existing Models: If you have a chemistry problem-solving model or algorithm, you can use this dataset to evaluate its performance and fine-tune it further.

Generate New Problem-Solution Pairs: You can use the existing problem-solution pairs as a starting point and leverage them to generate new problem-solution pairs by applying techniques like data augmentation or natural language processing.

Limitations

It's important to consider the following limitations of the dataset:

The dataset is AI-generated using the GPT-4 model, which means some solutions may

Research Ideas

Educational Resource: This dataset can be used to create an educational resource for chemistry students. The problem-solution pairs can be used as practice questions, allowing students to test their understanding and problem-solving skills.

AI Model Training: The dataset can be utilized to train AI models in the field of chemistry education. By feeding the problem-solution pairs into the model, it can learn to generate accurate solutions for various chemistry problems.

Research Analysis: Researchers in the field of chemistry education or n...
R
Dataset for the Assembly Line Preliminary Design Optimization Problem
entrepot.recherche.data.gouv.fr
text/x-fixed-field
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stéphanie Roussel; Stéphanie Roussel (2025). Dataset for the Assembly Line Preliminary Design Optimization Problem [Dataset]. http://doi.org/10.57745/IQLQ7A
Explore at:
text/x-fixed-field(108906), text/x-fixed-field(31997), text/x-fixed-field(34410)Available download formats
Unique identifier
https://doi.org/10.57745/IQLQ7A
Dataset updated
Sep 22, 2025
Dataset provided by
Recherche Data Gouv
Authors
Stéphanie Roussel; Stéphanie Roussel
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Description
Dataset for the Assembly Line Preliminary Design Optimization Problem presented in the conference CP'23. The Assembly Line Preliminary Design Problem consists in defining, for a given aircraft design, the best assembly line layout and the type and number of machines equipping each workstation. This dataset contains data coming from a real assembly line manufacturer. Each instance describes a scenario for building an aircraft (a build process) and details which tasks should be performed, and which types of machines are used. The objective of the problem considered in the original paper is to design a sert of pareto optimal assembly lines (number or stations, rate, number of machines) with regards to thé investment costume, time to build the entire aircraft and rate.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Mathematical Problems Dataset: Various [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathematical-problems-dataset-various-mathematic/code

Mathematical Problems Dataset: Various

Mathematical Problems Dataset: Questions and Answers

Explore at:

zip(2498203187 bytes)Available download formats

Dataset updated

Dec 2, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Mathematical Problems Dataset: Various Mathematical Problems and Solutions

Mathematical Problems Dataset: Questions and Answers

By math_dataset (From Huggingface) [source]

About this dataset

This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models

How to use the dataset

Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.

Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:

question: This column contains the text representation of the mathematical problem or equation.

answer: This column contains the text representation of the solution or answer to the corresponding problem.

Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.

Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.

Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.

Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.

Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.

Note: Please note that the dataset does not include dates.

By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!

Research Ideas

Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.

Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.

Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...

Clear search

Close search

Google apps

Main menu

Mathematical Problems Dataset: Various

Mathematical Problems Dataset: Various Mathematical Problems and Solutions

Mathematical Problems Dataset: Questions and Answers

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Number Words Dataset

math-problems-greedy-vs-best-of-n

Dataset of Video Comments of a Vision Video Classified by Their Relevance,...

Room Assignment problem

github-issues-dataset

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

optional:import numpy as npraw_np = np.array(raw)

Dataset of Math Word Problems In Spanish and MathML

SIAM 2007 Text Mining Competition dataset

math-problem-explanations-dataset

Data from: An Open-set Recognition and Few-Shot Learning Dataset for Audio...

EMS - Top Ten Dispatch Problems by Fiscal Year

Cdd Dataset

madelon

Abstract:

Source:

Data Set Information:

Relevant Papers:

GENEA Challenge 2022 Dataset Files

Data for: Permutation Flow Shop Scheduling with Multiple Lines and Demand...

CT-FAN-21 corpus: A dataset for Fake News Detection

QoG Standard Dataset

Chemistry Problem-Solution

Chemistry Problem-Solution

Chemistry Problem-Solution Dataset: 20K pairs across 25 topics and subtopics

About this dataset

How to use the dataset

About the Dataset

Main Topics and Subtopics

Problem Statement

Solution

How to Utilize this Dataset

Limitations

Research Ideas

Dataset for the Assembly Line Preliminary Design Optimization Problem

Mathematical Problems Dataset: Various

Mathematical Problems Dataset: Questions and Answers

Mathematical Problems Dataset: Various Mathematical Problems and Solutions

Mathematical Problems Dataset: Questions and Answers

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License