100+ datasets found
  1. Mathematical Problems Dataset: Various

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mathematical Problems Dataset: Various [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathematical-problems-dataset-various-mathematic/code
    Explore at:
    zip(2498203187 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mathematical Problems Dataset: Various Mathematical Problems and Solutions

    Mathematical Problems Dataset: Questions and Answers

    By math_dataset (From Huggingface) [source]

    About this dataset

    This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models

    How to use the dataset

    • Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.

    • Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:

      • question: This column contains the text representation of the mathematical problem or equation.
      • answer: This column contains the text representation of the solution or answer to the corresponding problem.
    • Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.

    • Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.

    • Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.

    • Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.

    • Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.

    Note: Please note that the dataset does not include dates.

    By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!

    Research Ideas

    • Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.
    • Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.
    • Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...

  2. Number Words Dataset

    • kaggle.com
    zip
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashutosh_kun (2024). Number Words Dataset [Dataset]. https://www.kaggle.com/datasets/ashutoshkun/number-words-dataset
    Explore at:
    zip(4704889 bytes)Available download formats
    Dataset updated
    Apr 25, 2024
    Authors
    Ashutosh_kun
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description: This dataset contains images of numbers written in words from one to fifty (one, ONE, One, two, TWO, Two, …….). Each image is stored in their respective folders named as (one,two,three….) .

    Content: Images: The dataset includes images of numbers written in words from one to hundred in various formats and styles. Images are provided in JPG, JPEG, PNG format.

    Usage: This dataset can be used to develop machine learning models for optical character recognition (OCR) tasks or Image Classification. The goal is to train a model that can predict what is written in words when given an image containing the word.

    Acknowledgements: This dataset was created for the purpose of solving the problem statement: "Develop a machine-learning model to train with images of numbers written in words from one to fifty.

  3. h

    math-problems-greedy-vs-best-of-n

    • huggingface.co
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeynep (2025). math-problems-greedy-vs-best-of-n [Dataset]. https://huggingface.co/datasets/Tandogan/math-problems-greedy-vs-best-of-n
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2025
    Authors
    Zeynep
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Problem Solving Math Dataset - Greedy vs Best-of-N

    This dataset contains mathematical problems and their solutions generated using two decoding strategies:

    Greedy Decoding: Generates a single deterministic solution. Best-of-N Decoding: Generates N solutions and selects the best one based on a scoring metric.

      Dataset Structure
    

    This dataset is created with a filtered subset of 20 level 1-3 problems from the MATH-500 dataset. To have a balance across the levels, the… See the full description on the dataset page: https://huggingface.co/datasets/Tandogan/math-problems-greedy-vs-best-of-n.

  4. Z

    Dataset of Video Comments of a Vision Video Classified by Their Relevance,...

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karras, Oliver; Kristo, Eklekta (2024). Dataset of Video Comments of a Vision Video Classified by Their Relevance, Polarity, Intention, and Topic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4533301
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Leibniz University Hannover
    TIB - Leibniz Information Centre for Science and Technology
    Authors
    Karras, Oliver; Kristo, Eklekta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all comments (comments and replies) of the YouTube vision video "Tunnels" by "The Boring Company" fetched on 2020-10-13 using YouTube API. The comments are classified manually by three persons. We performed a single-class labeling of the video comments regarding their relevance for requirement engineering (RE) (ham/spam), their polarity (positive/neutral/negative). Furthermore, we performed a multi-class labeling of the comments regarding their intention (feature request and problem report) and their topic (efficiency and safety). While a comment can only be relevant or not relevant and have only one polarity, a comment can have one or more intentions and also one or more topics.

    For the replies, one person also classified them regarding their relevance for RE. However, the investigation of the replies is ongoing and future work.

    Remark: For 126 comments and 26 replies, we could not determine the date and time since they were no longer accessible on YouTube at the time this data set was created. In the case of a missing date and time, we inserted "NULL" in the corresponding cell.

    This data set includes the following files:

    Dataset.xlsx contains the raw and labeled video comments and replies:

    For each comment, the data set contains:

    ID: An identification number generated by YouTube for the comment

    Date: The date and time of the creation of the comment

    Author: The username of the author of the comment

    Likes: The number of likes of the comment

    Replies: The number of replies to the comment

    Comment: The written comment

    Relevance: Label indicating the relevance of the comment for RE (ham = relevant, spam = irrelevant)

    Polarity: Label indicating the polarity of the comment

    Feature request: Label indicating that the comment request a feature

    Problem report: Label indicating that the comment reports a problem

    Efficiency: Label indicating that the comment deals with the topic efficiency

    Safety: Label indicating that the comment deals with the topic safety

    For each reply, the data set contains:

    ID: The identification number of the comment to which the reply belongs

    Date: The date and time of the creation of the reply

    Author: The username of the author of the reply

    Likes: The number of likes of the reply

    Comment: The written reply

    Relevance: Label indicating the relevance of the reply for RE (ham = relevant, spam = irrelevant)

    Detailed analysis results.xlsx contains the detailed results of all ten times repeated 10-fold cross validation analyses for each of all considered combinations of machine learning algorithms and features

    Guide Sheet - Multi-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual multi-class labeling

    Guide Sheet - Single-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual single-class labeling

    Python scripts for analysis.zip contains the scripts (as jupyter notebooks) and prepared data (as csv-files) for the analyses

  5. Room Assignment problem

    • kaggle.com
    zip
    Updated Oct 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sepulveda (2022). Room Assignment problem [Dataset]. https://www.kaggle.com/datasets/kathuman/room-assignment-problem
    Explore at:
    zip(6030 bytes)Available download formats
    Dataset updated
    Oct 19, 2022
    Authors
    Daniel Sepulveda
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains information about a number of participants (participants.csv) to a workshop that need to be assigned to a number of rooms (rooms.csv).

    Restrictions: 1.- The workshop has 5 different activities 2.- Each participant has indicated their first, second and third preferences for the activities available (Priority1, Priority2 and Priority3 columns in participants.csv) 3.- Participants are part of teams (Team column in participant.csv) and should be assigned together 4.- Each Activity lasts for half a day, and each participant will take part in one activity in the morning and one activity in the afternoon. 5.- Each Room must contain the SAME activity in the morning and in the afternoon.

    Requirements A.- Define the way i which each participant should be assigned through a csv file in the format Name;ActivityAM;RoomAM, ActivityPM;RoomPM B.- Maximize the number of people getting their 1st and 2nd preferences.

  6. h

    github-issues-dataset

    • huggingface.co
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharjeel Yunus (2025). github-issues-dataset [Dataset]. https://huggingface.co/datasets/sharjeelyunus/github-issues-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2025
    Authors
    Sharjeel Yunus
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📌 GitHub Issues Dataset

    📂 Dataset Name: github-issues-dataset📊 Total Issues: 114073📜 Format: Parquet (.parquet)🔍 Source: GitHub Repositories (Top 100 Repos)

      📖 Overview
    

    This dataset contains 114,073 GitHub issues collected from the top 100 repositories on GitHub.It is designed for issue classification, severity/priority prediction, and AI/ML training.

      ✅ This dataset is useful for:
    

    AI/ML Training: Fine-tune models for issue classification &… See the full description on the dataset page: https://huggingface.co/datasets/sharjeelyunus/github-issues-dataset.

  7. Z

    Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    German Cancer Research Center
    Max Delbrück Center
    Howard Hughes Medical Institute - Janelia Research Campus
    Max Delbrück Center for Molecular Medicine
    Authors
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

    30 completely labeled (segmented) images

    71 partly labeled images

    altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

    To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

    A set of metrics and a novel ranking score for respective meaningful method benchmarking

    An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    fisbe_v1.0_{completely,partly}.zip

    contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

    fisbe_v1.0_mips.zip

    maximum intensity projections of all samples, for convenience.

    sample_list_per_split.txt

    a simple list of all samples and the subset they are in, for convenience.

    view_data.py

    a simple python script to visualize samples, see below for more information on how to use it.

    dim_neurons_val_and_test_sets.json

    a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

    Readme.md

    general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

    How to open zarr files

    Install the python zarr package:

    pip install zarr

    Opened a zarr file with:

    import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

    optional:import numpy as npraw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    Install napari:

    pip install "napari[all]"

    Save the following Python script:

    import zarr, sys, napari

    raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

    viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

    Execute:

    python view_data.py /R9F03-20181030_62_B5.zarr

    Metrics

    S: Average of avF1 and C

    avF1: Average F1 Score

    C: Average ground truth coverage

    clDice_TP: Average true positives clDice

    FS: Number of false splits

    FM: Number of false merges

    tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  8. m

    Dataset of Math Word Problems In Spanish and MathML

    • data.mendeley.com
    Updated May 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Sossa (2024). Dataset of Math Word Problems In Spanish and MathML [Dataset]. http://doi.org/10.17632/skbvhkz5th.1
    Explore at:
    Dataset updated
    May 30, 2024
    Authors
    Kevin Sossa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 150 Math Word Problems(MWP). Each problem consists of textual math problems that involve the application of first and second-degree mathematical equations for their resolution. To create this set, academic and educational sources containing first and second-degree math problems were selected, and some original problems were also included.

    Each problem in the dataset is structured as follows:

    "question": A textual description of the math problem in Spanish "mathml_equations": The corresponding equation for the problem, expressed in MathML format to facilitate processing and manipulation by machine learning models. "Difficulty": The number of variables in the equation. "Grade": The grade of the equation, with 1 indicating a linear equation and 2 indicating a quadratic equation. "Index: A unique identifier for each problem in the dataset. "Author": The creator or source of the problem. "Ref": The source or citation for the problem, if applicable.

  9. d

    SIAM 2007 Text Mining Competition dataset

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.

  10. h

    math-problem-explanations-dataset

    • huggingface.co
    Updated Dec 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    harish (2024). math-problem-explanations-dataset [Dataset]. https://huggingface.co/datasets/harishkotra/math-problem-explanations-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2024
    Authors
    harish
    Description

    Dataset Card for math-problem-explanations-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/hk-gaianet/math-problem-explanations-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/harishkotra/math-problem-explanations-dataset.

  11. Z

    Data from: An Open-set Recognition and Few-Shot Learning Dataset for Audio...

    • data.niaid.nih.gov
    • data.europa.eu
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Naranjo-Alcazar; Sergi Perez-Castanos; Pedro Zuccarello; Maximo Cobos (2024). An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3689287
    Explore at:
    Dataset updated
    May 21, 2024
    Dataset provided by
    Visualfy
    Universitat de Valencia
    Authors
    Javier Naranjo-Alcazar; Sergi Perez-Castanos; Pedro Zuccarello; Maximo Cobos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The problem of training a deep neural network with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications are those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at providing the audio recognition community with a carefully annotated dataset for FSL and OSR comprised of 1360 clips from 34 classes divided into pattern sounds and unwanted sounds. To facilitate and promote research in this area, results with two baseline systems (one trained from scratch and another based on transfer learning), are presented.

  12. d

    EMS - Top Ten Dispatch Problems by Fiscal Year

    • catalog.data.gov
    • data.austintexas.gov
    Updated Oct 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.austintexas.gov (2025). EMS - Top Ten Dispatch Problems by Fiscal Year [Dataset]. https://catalog.data.gov/dataset/ems-top-ten-dispatch-problems-by-fiscal-year
    Explore at:
    Dataset updated
    Oct 25, 2025
    Dataset provided by
    data.austintexas.gov
    Description

    This table shows the 10 most frequently recorded incident problem types as recorded by communications personnel for each fiscal year presented.

  13. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

  14. o

    madelon

    • openml.org
    Updated May 22, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). madelon [Dataset]. https://www.openml.org/d/1485
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2015
    Description

    Author: Isabelle Guyon
    Source: UCI
    Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

    Abstract:

    MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

    Source:

    Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

    Data Set Information:

    MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

    This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

    There is no attribute information provided to avoid biasing the feature selection process.

    Relevant Papers:

    The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

    Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

    Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.

  15. GENEA Challenge 2022 Dataset Files

    • zenodo.org
    txt, zip
    Updated Oct 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pieter Wolfert; Pieter Wolfert (2022). GENEA Challenge 2022 Dataset Files [Dataset]. http://doi.org/10.5281/zenodo.6998231
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Oct 17, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pieter Wolfert; Pieter Wolfert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Zenodo repository contains the main dataset for the GENEA 2022 challenge, which is based on the Talking With Hands 16.2M data.

    Contents:

    The "trn" and "val" zip files contain audio files (in WAV format), time-aligned transcriptions (in TSV format), and motion files (in BVH format) for the training and validation datasets, respectively.

    The "tst" zip file contains audio files (in WAV format) and transcriptions (in TSV format) for the test set, but no motion. The corresponding test motion is available at:

    https://doi.org/10.5281/zenodo.6976463

    Each zip file also contains a "metadata.csv" file that contains information for all files regarding the speaker ID and whether or not the motion files contain finger motion.

    Note that the speech audio in the data sometimes has been replaced by silence for the purpose of anonymisation.

    Data processing scripts:

    We provide a number of optional scripts for encoding and processing the challenge data:

    Audio: Scripts for extracting basic audio features, such as spectrograms, prosodic features, and mel-frequency cepstral coefficients (MFCCs) can be found at this link.

    Text: A script to encode text transcriptions to word vectors using FastText is available. tsv2wordvectors.py

    Motion: If you wish to encode the joint angles from the BVH files to and from an exponential map representation, you can use scripts by Simon Alexanderson based on the PyMo library, which are available here:

    Attribution:

    If you use this material, please cite our latest paper on the GENEA Challenge 2022. At the time of writing (2022-08-16), that is our ACM ICMI 2022 paper:

    Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI '22). ACM.

    You can find the latest information and a BibTeX file on the project website:

    https://youngwoo-yoon.github.io/GENEAchallenge2022/

    Also cite the paper about the original dataset from Meta Research:

    Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19). IEEE, 763–772.

    The motion and audio files are based on the Talking With Hands 16.2M dataset at https://github.com/facebookresearch/TalkingWithHands32M/. All material is available under a CC BY NC 4.0 international license, with the text provided in LICENSE.txt.

    To find more GENEA Challenge 2022 material on the web, please see:

    * https://youngwoo-yoon.github.io/GENEAchallenge2022/

    * https://genea-workshop.github.io/2022/challenge/

    If you have any questions or comments, please contact:

    * The GENEA Challenge & Workshop organisers

  16. n

    Data for: Permutation Flow Shop Scheduling with Multiple Lines and Demand...

    • narcis.nl
    • data.mendeley.com
    Updated Jun 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brammer, J (via Mendeley Data) (2021). Data for: Permutation Flow Shop Scheduling with Multiple Lines and Demand Plans Using Reinforcement Learning [Dataset]. http://doi.org/10.17632/5txxwj2g6b.3
    Explore at:
    Dataset updated
    Jun 1, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Brammer, J (via Mendeley Data)
    Description

    Description of dataset

    This repository contains the datasets used in our study "Permutation Flow Shop Scheduling with Multiple Lines and Demand Plans Using Reinforcement Learning".

    The repository provides three datasets. The main dataset in the folder data contains 1050 problem instances for the multi-line permutation flow shop problem. The dataset extends the dataset introduced by Taillard (1993) to multiple production lines and demand plans. The processing times were sampled from the uniform interval [1,99]. In addition, we sampled the demand plans from a multinomial distribution with equal probability for each job type, which leads to rather balanced demand plans.

    The folders data_lin and data_exp each contain 150 problem instances with more imbalanced demand plans, where the probabilities of each job type decrease linearly or exponentially. However, the quantity to be produced for each job type is greater than zero.

    Dataset structure

    Each dataset of the main study is structured in 15 subfolders. Each folder contains problem instances for a combination of layout and processing time variation.

    Folder name notation: Tai_PFSP_L_

    A: Number of production lines (1-3) B: Processing time variation (1-5)

    All of these folders contain 70 problem instances. A problem file is a combination of the problem layout (sequence length, number of machines and stations) and the demand plan variation. The processing times are fixed for one problem characteristic, but the ten demand plans are different.

    File name notation: t

    C: Number of production lines (1-3) D: Identifier of the layout (1-7) E: Sequence length (20,100,500) F: Number of machines per line (5,10,20) G: Number of job types (5,10,20) H: Number of demand plan variation (1-10)

    Each file represents a different problem in text format with the following notation:

    Line 1: Demand plan Line 2: Layout Type Line 3: Number of machines Line 4: Number of machines per line Line 5: Number of total machines with synchronization machine Line 6: Number of job types Line 7-end: Processing times in matrix form for machines (rows) and job types (columns)

  17. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  18. r

    QoG Standard Dataset

    • researchdata.se
    • gimi9.com
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Teorell; Aksel Sundström; Sören Holmberg; Bo Rothstein; Natalia Alvarado Pachon; Cem Mert Dalli (2024). QoG Standard Dataset [Dataset]. http://doi.org/10.18157/QoGStdJan22
    Explore at:
    (129777582)Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    University of Gothenburg
    Authors
    Jan Teorell; Aksel Sundström; Sören Holmberg; Bo Rothstein; Natalia Alvarado Pachon; Cem Mert Dalli
    Time period covered
    1946
    Description

    The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.

    The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.

    QoG Standard Dataset is the largest dataset consisting of more than 2,000 variables from sources related to the Quality of Government. The data exist in both time-series (year 1946 and onwards) and cross-section (year 2020). Many of the variables are available in both datasets, but some are not. The datasets draws on a number of freely available data sources related to QoG and its correlates.

    In the QoG Standard CS dataset, data from and around 2020 is included. Data from 2020 is prioritized; however, if no data is available for a country for 2020, data for 2021 is included. If no data exists for 2021, data for 2019 is included, and so on up to a maximum of +/- 3 years.

    In the QoG Standard TS dataset, data from 1946 and onwards is included and the unit of analysis is country-year (e.g., Sweden-1946, Sweden-1947, etc.).

  19. Chemistry Problem-Solution

    • kaggle.com
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Chemistry Problem-Solution [Dataset]. https://www.kaggle.com/datasets/thedevastator/chemistry-problem-solution-dataset
    Explore at:
    zip(9075076 bytes)Available download formats
    Dataset updated
    Dec 1, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Chemistry Problem-Solution

    Chemistry Problem-Solution Dataset: 20K pairs across 25 topics and subtopics

    By camel-ai (From Huggingface) [source]

    About this dataset

    To ensure diversity and coverage across various aspects of chemistry, this dataset spans across 25 main topics, encompassing a wide range of subtopics within each main topic. Each main topic and subtopic combination contains an extensive set of 32 distinct problems for analysis and study.

    In order to facilitate efficient data exploration and analysis, the dataset is structured with essential columns including 'role_1' which signifies the role or identity responsible for presenting either the problem statement or solution. Additionally, 'sub_topic' denotes the specific subarea within each main topic to which both problem and solution belong.

    By utilizing this expansive dataset containing accurate problem statements and their corresponding solutions from diverse topics in chemistry along with their categorization into distinct domains (both main topics and subtopics), users can seamlessly navigate through specific areas of interest while making informed decisions about which subsets they'd like to explore further based on their project requirements or learning objectives.

    Please note that since generating this dataset was performed using GPT-4 model powered by artificial intelligence algorithms it's critical to conduct careful validation checks when implementing these data points in real-life scenarios or academic research work where precision plays a vital role

    How to use the dataset

    About the Dataset

    The dataset contains 20,000 pairs of problem statements and their corresponding solutions, covering a wide range of topics within the field of chemistry. These pairs have been generated using the GPT-4 model, ensuring that they are diverse and representative of various concepts in chemistry.

    Main Topics and Subtopics

    The dataset is organized into 25 main topics, with each topic having 25 subtopics. The main topics represent broader areas within chemistry, while the subtopics narrow down to specific subjects within each main topic. This hierarchical structure allows for better categorization and navigation through different aspects of chemistry problems.

    Problem Statement

    The problem statement (message_1) column provides a concise description or statement of a specific chemistry problem. It sets up the context for understanding what needs to be solved or analyzed.

    Solution

    The solution (message_2) column contains the respective answer or solution to each problem statement. It offers insights into how to approach and solve specific types of chemistry problems.

    How to Utilize this Dataset

    Here are some ways you can leverage this dataset:

    • Study Specific Topics: Since there are 25 main topics with multiple subtopics in this dataset, you can focus on exploring certain areas that interest you or align with your learning goals in chemistry.

    • Develop Learning Resources: As an educator or content creator, you can use this dataset as inspiration for creating educational materials such as textbooks, online courses, or lesson plans focused on different topics within chemistry.

    • Build Intelligent Systems: If you're working on developing AI-powered systems related to solving chemistry problems or providing chemical insights, this dataset can serve as training data for your models.

    • Evaluate Existing Models: If you have a chemistry problem-solving model or algorithm, you can use this dataset to evaluate its performance and fine-tune it further.

    • Generate New Problem-Solution Pairs: You can use the existing problem-solution pairs as a starting point and leverage them to generate new problem-solution pairs by applying techniques like data augmentation or natural language processing.

    Limitations

    It's important to consider the following limitations of the dataset:

    • The dataset is AI-generated using the GPT-4 model, which means some solutions may

    Research Ideas

    • Educational Resource: This dataset can be used to create an educational resource for chemistry students. The problem-solution pairs can be used as practice questions, allowing students to test their understanding and problem-solving skills.
    • AI Model Training: The dataset can be utilized to train AI models in the field of chemistry education. By feeding the problem-solution pairs into the model, it can learn to generate accurate solutions for various chemistry problems.
    • Research Analysis: Researchers in the field of chemistry education or n...
  20. R

    Dataset for the Assembly Line Preliminary Design Optimization Problem

    • entrepot.recherche.data.gouv.fr
    text/x-fixed-field
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stéphanie Roussel; Stéphanie Roussel (2025). Dataset for the Assembly Line Preliminary Design Optimization Problem [Dataset]. http://doi.org/10.57745/IQLQ7A
    Explore at:
    text/x-fixed-field(108906), text/x-fixed-field(31997), text/x-fixed-field(34410)Available download formats
    Dataset updated
    Sep 22, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Stéphanie Roussel; Stéphanie Roussel
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Description

    Dataset for the Assembly Line Preliminary Design Optimization Problem presented in the conference CP'23. The Assembly Line Preliminary Design Problem consists in defining, for a given aircraft design, the best assembly line layout and the type and number of machines equipping each workstation. This dataset contains data coming from a real assembly line manufacturer. Each instance describes a scenario for building an aircraft (a build process) and details which tasks should be performed, and which types of machines are used. The objective of the problem considered in the original paper is to design a sert of pareto optimal assembly lines (number or stations, rate, number of machines) with regards to thé investment costume, time to build the entire aircraft and rate.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Mathematical Problems Dataset: Various [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathematical-problems-dataset-various-mathematic/code
Organization logo

Mathematical Problems Dataset: Various

Mathematical Problems Dataset: Questions and Answers

Explore at:
zip(2498203187 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Mathematical Problems Dataset: Various Mathematical Problems and Solutions

Mathematical Problems Dataset: Questions and Answers

By math_dataset (From Huggingface) [source]

About this dataset

This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models

How to use the dataset

  • Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.

  • Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:

    • question: This column contains the text representation of the mathematical problem or equation.
    • answer: This column contains the text representation of the solution or answer to the corresponding problem.
  • Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.

  • Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.

  • Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.

  • Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.

  • Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.

Note: Please note that the dataset does not include dates.

By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!

Research Ideas

  • Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.
  • Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.
  • Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...

Search
Clear search
Close search
Google apps
Main menu