92 datasets found
  1. Z

    Data from: A New Bayesian Approach to Increase Measurement Accuracy Using a...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domjan, Peter; Angyal, Viola; Bertalan, Adam; Vingender, Istvan; Dinya, Elek (2025). A New Bayesian Approach to Increase Measurement Accuracy Using a Precision Entropy Indicator [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14417120
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Semmelweis University
    Authors
    Domjan, Peter; Angyal, Viola; Bertalan, Adam; Vingender, Istvan; Dinya, Elek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "We believe that by accounting for the inherent uncertainty in the system during each measurement, the relationship between cause and effect can be assessed more accurately, potentially reducing the duration of research."

    Short description

    This dataset was created as part of a research project investigating the efficiency and learning mechanisms of a Bayesian adaptive search algorithm supported by the Imprecision Entropy Indicator (IEI) as a novel method. It includes detailed statistical results, posterior probability values, and the weighted averages of IEI across multiple simulations aimed at target localization within a defined spatial environment. Control experiments, including random search, random walk, and genetic algorithm-based approaches, were also performed to benchmark the system's performance and validate its reliability.

    The task involved locating a target area centered at (100; 100) within a radius of 10 units (Research_area.png), inside a circular search space with a radius of 100 units. The search process continued until 1,000 successful target hits were achieved.

    To benchmark the algorithm's performance and validate its reliability, control experiments were conducted using alternative search strategies, including random search, random walk, and genetic algorithm-based approaches. These control datasets serve as baselines, enabling comprehensive comparisons of efficiency, randomness, and convergence behavior across search methods, thereby demonstrating the effectiveness of our novel approach.

    Uploaded files

    The first dataset contains the average IEI values, generated by randomly simulating 300 x 1 hits for 10 bins per quadrant (4 quadrants in total) using the Python programming language, and calculating the corresponding IEI values. This resulted in a total of 4 x 10 x 300 x 1 = 12,000 data points. The summary of the IEI values by quadrant and bin is provided in the file results_1_300.csv. The calculation of IEI values for averages is based on likelihood, using an absolute difference-based approach for the likelihood probability computation. IEI_Likelihood_Based_Data.zip

    The weighted IEI average values for likelihood calculation (Bayes formula) are provided in the file Weighted_IEI_Average_08_01_2025.xlsx

    This dataset contains the results of a simulated target search experiment using Bayesian posterior updates and Imprecision Entropy Indicators (IEI). Each row represents a hit during the search process, including metrics such as Shannon entropy (H), Gini index (G), average distance, angular deviation, and calculated IEI values. The dataset also includes bin-specific posterior probability updates and likelihood calculations for each iteration. The simulation explores adaptive learning and posterior penalization strategies to optimize the search efficiency. Our Bayesian adaptive searching system source code (search algorithm, 1000 target searches): IEI_Self_Learning_08_01_2025.pyThis dataset contains the results of 1,000 iterations of a successful target search simulation. The simulation runs until the target is successfully located for each iteration. The dataset includes further three main outputs: a) Results files (results{iteration_number}.csv): Details of each hit during the search process, including entropy measures, Gini index, average distance and angle, Imprecision Entropy Indicators (IEI), coordinates, and the bin number of the hit. b) Posterior updates (Pbin_all_steps_{iter_number}.csv): Tracks the posterior probability updates for all bins during the search process acrosations multiple steps. c) Likelihoodanalysis(likelihood_analysis_{iteration_number}.csv): Contains the calculated likelihood values for each bin at every step, based on the difference between the measured IEI and pre-defined IE bin averages. IEI_Self_Learning_08_01_2025.py

    Based on the mentioned Python source code (see point 3, Bayesian adaptive searching method with IEI values), we performed 1,000 successful target searches, and the outputs were saved in the:Self_learning_model_test_output.zip file.

    Bayesian Search (IEI) from different quadrant. This dataset contains the results of Bayesian adaptive target search simulations, including various outputs that represent the performance and analysis of the search algorithm. The dataset includes: a) Heatmaps (Heatmap_I_Quadrant, Heatmap_II_Quadrant, Heatmap_III_Quadrant, Heatmap_IV_Quadrant): These heatmaps represent the search results and the paths taken from each quadrant during the simulations. They indicate how frequently the system selected each bin during the search process. b) Posterior Distributions (All_posteriors, Probability_distribution_posteriors_values, CDF_posteriors_values): Generated based on posterior values, these files track the posterior probability updates, including cumulative distribution functions (CDF) and probability distributions. c) Macro Summary (summary_csv_macro): This file aggregates metrics and key statistics from the simulation. It summarizes the results from the individual results.csv files. d) Heatmap Searching Method Documentation (Bayesian_Heatmap_Searching_Method_05_12_2024): This document visualizes the search algorithm's path, showing how frequently each bin was selected during the 1,000 successful target searches. e) One-Way ANOVA Analysis (Anova_analyze_dataset, One_way_Anova_analysis_results): This includes the database and SPSS calculations used to examine whether the starting quadrant influences the number of search steps required. The analysis was conducted at a 5% significance level, followed by a Games-Howell post hoc test [43] to identify which target-surrounding quadrants differed significantly in terms of the number of search steps. Results were saved in the Self_learning_model_test_results.zip

    This dataset contains randomly generated sequences of bin selections (1-40) from a control search algorithm (random search) used to benchmark the performance of Bayesian-based methods. The process iteratively generates random numbers until a stopping condition is met (reaching target bins 1, 11, 21, or 31). This dataset serves as a baseline for analyzing the efficiency, randomness, and convergence of non-adaptive search strategies. The dataset includes the following: a) The Python source code of the random search algorithm. b) A file (summary_random_search.csv) containing the results of 1000 successful target hits. c) A heatmap visualizing the frequency of search steps for each bin, providing insight into the distribution of steps across the bins. Random_search.zip

    This dataset contains the results of a random walk search algorithm, designed as a control mechanism to benchmark adaptive search strategies (Bayesian-based methods). The random walk operates within a defined space of 40 bins, where each bin has a set of neighboring bins. The search begins from a randomly chosen starting bin and proceeds iteratively, moving to a randomly selected neighboring bin, until one of the stopping conditions is met (bins 1, 11, 21, or 31). The dataset provides detailed records of 1,000 random walk iterations, with the following key components: a) Individual Iteration Results: Each iteration's search path is saved in a separate CSV file (random_walk_results_.csv), listing the sequence of steps taken and the corresponding bin at each step. b) Summary File: A combined summary of all iterations is available in random_walk_results_summary.csv, which aggregates the step-by-step data for all 1,000 random walks. c) Heatmap Visualization: A heatmap file is included to illustrate the frequency distribution of steps across bins, highlighting the relative visit frequencies of each bin during the random walks. d) Python Source Code: The Python script used to generate the random walk dataset is provided, allowing reproducibility and customization for further experiments. Random_walk.zip

    This dataset contains the results of a genetic search algorithm implemented as a control method to benchmark adaptive Bayesian-based search strategies. The algorithm operates in a 40-bin search space with predefined target bins (1, 11, 21, 31) and evolves solutions through random initialization, selection, crossover, and mutation over 1000 successful runs. Dataset Components: a) Run Results: Individual run data is stored in separate files (genetic_algorithm_run_.csv), detailing: Generation: The generation number. Fitness: The fitness score of the solution. Steps: The path length in bins. Solution: The sequence of bins visited. b) Summary File: summary.csv consolidates the best solutions from all runs, including their fitness scores, path lengths, and sequences. c) All Steps File: summary_all_steps.csv records all bins visited during the runs for distribution analysis. d) A heatmap was also generated for the genetic search algorithm, illustrating the frequency of bins chosen during the search process as a representation of the search pathways.Genetic_search_algorithm.zip

    Technical Information

    The dataset files have been compressed into a standard ZIP archive using Total Commander (version 9.50). The ZIP format ensures compatibility across various operating systems and tools.

    The XLSX files were created using Microsoft Excel Standard 2019 (Version 1808, Build 10416.20027)

    The Python program was developed using Visual Studio Code (Version 1.96.2, user setup), with the following environment details: Commit fabd6a6b30b49f79a7aba0f2ad9df9b399473380f, built on 2024-12-19. The Electron version is 32.6, and the runtime environment includes Chromium 128.0.6263.186, Node.js 20.18.1, and V8 12.8.374.38-electron.0. The operating system is Windows NT x64 10.0.19045.

    The statistical analysis included in this dataset was partially conducted using IBM SPSS Statistics, Version 29.0.1.0

    The CSV files in this dataset were created following European standards, using a semicolon (;) as the delimiter instead of a comma, encoded in UTF-8 to ensure compatibility with a wide

  2. f

    Data from: Methodology to filter out outliers in high spatial density data...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

  3. a

    Data from: Precision, Accuracy, and Aliasing of Sea Ice Thickness from...

    • arcticdata.io
    • dataone.org
    Updated Oct 21, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cathleen Geiger (2016). Precision, Accuracy, and Aliasing of Sea Ice Thickness from Multiple Observing Platforms [Dataset]. https://arcticdata.io/catalog/view/urn%3Auuid%3A561f3583-952d-4769-a7f7-0535aaedabfe
    Explore at:
    Dataset updated
    Oct 21, 2016
    Dataset provided by
    Arctic Data Center
    Authors
    Cathleen Geiger
    Time period covered
    Sep 1, 2011 - Aug 31, 2015
    Area covered
    Description

    Much attention on the extent, and changes therein, of Arctic sea ice focuses on areal coverage. Equally if not more important to consider is the thickness of the ice, which varies with age and also due to movement of the ice. Thickness is geometrically rough with ridges, rubble fields, and open-water leads. These features are asymmetrically shaped from centimeters to meters in one horizontal direction and meters to kilometers along the other. Instruments used to sense ice thickness typically have wide footprints that alias these rough features into smoother flatter features. Accurate thickness distribution of these deformed areas is needed to reduce uncertainties in global thickness data archives. Such results lead to higher accuracy in regional and global sea ice volume estimates. High precision and consistency from a single instrument cannot quantify the impact of aliasing. The sea ice community currently seeks an integrated- instrument approach to measure sea ice thickness from its components of draft, freeboard, and surface elevation (including snow loads) and thickness archives are being developed. This project would address the central question, "What is the impact of spatial aliasing when measuring sea ice thickness, its distribution, and resulting volume?" The approach is to work with datasets that include measurements made by two or more instruments with different footprints in the same location, based on a recently discovered relationship between the roughness of 5m and 40m footprints stemming from one field experiment. The investigator proposes a generalized solution to track all types of sea ice thickness measurements as a function of instrument footprint size and shape. The investigator will isolate climate data records containing coincident sea ice thickness measurements from instruments of different footprints to demonstrate the utility of a general solution to track a phenomenon called "Resolution Error". Results will be evaluated through a power law which can easily be reproduced with an explicit and simple algorithm so that anyone with a structured programming language can examine the degree of resolution error between two data sets. The main scientific contribution is an improved metadata relationship to characterize properties that contribute to thickness uncertainties in climate data archive records as a function of scale. A Ph.D. thesis will advance the aliasing hypothesis using a full physics 3D electromagnetic induction model as a guide to develop a new calibration technique for ground based electromagnetic (EM) measurements. The bonus is the calculation of a bulk material conductivity for local sea ice and sea water as a geophysical representation of the calibration coefficients. The broader impacts of the work include a simple Resolution Error tracking algorithm to improve thickness archive data integration best practices and the training of a Ph.D. student.

  4. f

    Data from: Statistical significance, selection accuracy, and experimental...

    • datasetcatalog.nlm.nih.gov
    • scielo.figshare.com
    Updated Oct 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Resende, Marcos Deon Vilela; Alves, Rodrigo Silva (2022). Statistical significance, selection accuracy, and experimental precision in plant breeding [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000370794
    Explore at:
    Dataset updated
    Oct 29, 2022
    Authors
    de Resende, Marcos Deon Vilela; Alves, Rodrigo Silva
    Description

    Abstract Genetic selection efficiency is measured by accuracy. Model selection relies on hypothesis testing with effectiveness given by statistical significance (p-value). Estimates of selection accuracy are based on variance parameters and precision. Model selection considers the amount of genetic variability and significance of effects. Questions arise as to which one to use: accuracy or p-value? We show there is a link between the two and both may be used. We derive equations for accuracy in multi-environment trials and determine numbers of repetitions and environments to reach accuracy. We propose a new methodology for accuracy classification based on p-values. This enables a better understanding of the level of accuracy being accepted when certain p-value is used. Accuracy of 90% is associated with p-value of 2%. Use of p-values up to 20% (accuracies above 50%) are acceptable to verify significance of genetic effects. Sample sizes for desired p-values are found via accuracy values.

  5. d

    Data from: Particle-size analysis results for a variety of natural and...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Particle-size analysis results for a variety of natural and man-made materials used to assess the precision and accuracy of laboratory laser-diffraction particle-size analysis of fluvial sediment [Dataset]. https://catalog.data.gov/dataset/particle-size-analysis-results-for-a-variety-of-natural-and-man-made-materials-used-to-ass
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The dataset documents results from testing of 1) vendor-supplied reference materials 2) NIST-traceable polysidperse glass bead reference materials 3) mixtures of commercially-available glass beads 4) mixtures of internal reference materials prepared from geologic material

  6. Z

    Dataset for "A simple and accurate method to determine fluid-crystal phase...

    • data.niaid.nih.gov
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smallenburg, Frank (2024). Dataset for "A simple and accurate method to determine fluid-crystal phase boundaries from direct coexistence simulations" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11259761
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset provided by
    Centre National de la Recherche Scientifique
    Authors
    Smallenburg, Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset for the article "A simple and accurate method to determine fluid-crystal phase boundaries from direct coexistence simulations", available at https://arxiv.org/abs/2403.10891 (Full citation data will be added upon final publication of the article.)

    This package provides figure data and representative configuration files associated with the systems studied in the article above. Additionally, for the hard sphere system, this package includes direct coexistence data for all reported system sizes and crystal orientations.

  7. d

    Accuracy and Precision of Methods to Test Ebola-Relevant Chlorine...

    • datasets.ai
    21
    Updated Nov 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    US Agency for International Development (2020). Accuracy and Precision of Methods to Test Ebola-Relevant Chlorine Concentrations [Dataset]. https://datasets.ai/datasets/accuracy-and-precision-of-methods-to-test-ebola-relevant-chlorine-concentrations
    Explore at:
    21Available download formats
    Dataset updated
    Nov 10, 2020
    Dataset authored and provided by
    US Agency for International Development
    Description

    USAID, in partnership with the host governments and international donors, is implementing a robust set of development programs to address the second order impacts and ensure that Guinea, Sierra Leone, Liberia and other nations in the region are prepared to effectively prevent, detect, and respond to future outbreaks. This data asset includes data on the accuracy and precision of different test kit methods commonly used in the field in emergency response to test chlorine at the 0.5% and 0.05% levels, in comparison to gold standard methods.

  8. f

    Data from: Uncertainty-Informed Screening for Safer Solvents Used in the...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arpan Mukherjee; Deepesh Giri; Krishna Rajan (2025). Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models [Dataset]. http://doi.org/10.1021/acs.jcim.5c00612.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    ACS Publications
    Authors
    Arpan Mukherjee; Deepesh Giri; Krishna Rajan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recalla trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.

  9. Landmark Detection for Tsetse Fly

    • kaggle.com
    zip
    Updated Jan 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Landmark Detection for Tsetse Fly [Dataset]. https://www.kaggle.com/datasets/thedevastator/automated-landmark-detection-for-14354-tsetse-fl
    Explore at:
    zip(4496352 bytes)Available download formats
    Dataset updated
    Jan 15, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Automated Landmark Detection for 14,354 Tsetse Fly Wings

    Accurate Morphometric Data

    By [source]

    About this dataset

    This dataset contains the coordinates of 11 anatomical landmarks on 14,354 pairs of field-collected tsetse fly wings. Accurately located with automatic deep learning by a two-tier method, this identification process is essential for those conducting morphological or biological research on the species Glossina pallidipes and G. m. morsitans. An accurate capture of these data points is both difficult and time-consuming — making our employee double tier method an invaluable resource for any researchers in need! Columns include morphology data such as wing length measurements, landmark locations, host collections, collection dates/months/years, morphometric data strings and more — allowing you to uncover new insights into these fascinating insects through detailed analysis! Unlock new discoveries within the natural world by exploring this exciting dataset today — from gaining insight into tsetse fly wing characteristics to larger implications regarding biology and evolution— you never know what exciting findings await!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Step 1: Download the data from Kaggle. Unzip it and open it in your favorite spreadsheet software (e.g., Excel or Google Sheets).

    Step 2: Become familiar with the two available data fields in ALDTTFW — wing length measurement ‘wlm' and distance between left and right wings ‘dis_l'. These two pieces of information are extremely helpful when analyzing wingpair morphology within a larger sample size as they allow researchers to identify discrepancies between multiple sets of wings in a given group quickly and easily.

    **Step 3: ** Take note of each wing's landmark coordinates, which can be found under columns lmkl through lmkr — there are 11 total areas measured per each individual left and right wing (e.g., ‘L x1’: X coordinate of first landmark on the left wing provides anatomical precision)

    **Step 4: ** Make sure that both wings have been labeled accurately by checking out their respective quality grades found under columns 'left_good' and 'right_good'. A grade of either 0 or 1 indicates whether background noise is present, which could result in inaccurate set of landmark points later on during analysis; thus grade should always be 1 before continuing with further steps

    ** Step 5 :** Calculate pertinent averages from given values such as overall wing span measurement or anatomic landmarks distances – these averages shall tell us if there exist particular traits distinguishing among multiple groups gathered together for comparison purposes

    Lastly – always double check accuracy! It is advised that you reference previously collected literature regarding locations specific anatomic landmarks prior making any final conclusions from your

    Research Ideas

    • Comparing the morphology of tsetse fly wings across different host species, locations, and/or collections.
    • Creating classification algorithms for morphometric analysis that use deep learning architectures for automatic landmark detection.
    • Developing high resolution identifying methods (or markers) to distinguish between tsesse fly species and subspecies based on their wing anatomy landmarks

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: morphometric_data.csv | Column name | Description | |:---------------|:----------------------------------------------| | vpn | Unique identifier for the wing pair. (String) | | cd | Collection date. (Date) | | cm | Collection month. (Integer) | | cy | Collection year. (Integer) | | md | Morphometric data. (String) | | g | Genus. (String) | | wlm | Wing length measurem...

  10. d

    Data from: Accuracy and precision of an umbilical-based method for...

    • search.dataone.org
    • datadryad.org
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne Ju Laberge; Pier-Olivier Cusson; Joanie Van de Walle; Xavier Bordeleau; Mike Hammill; Fanie Pelletier (2024). Accuracy and precision of an umbilical-based method for estimating birthdates of pre-weaned harbour seal pups [Dataset]. http://doi.org/10.5061/dryad.2jm63xt08
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Anne Ju Laberge; Pier-Olivier Cusson; Joanie Van de Walle; Xavier Bordeleau; Mike Hammill; Fanie Pelletier
    Description

    We evaluated the accuracy and precision of an umbilical-based method for fine-scale age and birthdate estimation in a wild population of harbour seal pups (Phoca vitulina) in the St. Lawrence Estuary, Quebec, Canada. The method consists of attributing umbilicus degeneration scores to estimate pup age in days. To assess its validity, we first constructed a score attribution test with field pictures of pup umbilical cords at various stages of degeneration. This test was completed by 8 observers, and we measured the accuracy and precision of the umbilicus degeneration score attribution. We then used data from 758 pups (captured between 1998 to 2023) for which an umbilical degeneration score was assigned in the field to evaluate the efficiency of this score to estimate birthdate. We show that observers can accurately and precisely attribute umbilicus scores, and that this umbilical-based method can accurately estimate pup birthdates. Here are the two datasets used, as well as a README.md fi..., The data was collected by capturing wild habour seal pups in the St. Lawrence Estuary. We used data on the umbilicus degeneration scores in this study, from which we took pictures directly in the field. Those pictures were used to construct the picture test data set. , , # Data from: Accuracy and precision of an umbilical-based method for estimating birthdates of pre-weaned harbour seal pups

    https://doi.org/10.5061/dryad.2jm63xt08

    Description of the data and file structure

    The data was collected by capturing wild habour seal pups in the St. Lawrence Estuary. We used data on the umbilicus degeneration scores in this study, from which we took pictures directly in the field. Those pictures were used to construct the picture test data set.Â

    Files and variables

    File: repetability_test_data_anonyme.csv

    Description:Â This dataset was used to evaluate the accuracy and precision when attributing an umbilicus degeneration score. The data in this dataset comes from pictures of umbilicus cords taken on the field during the 2023 field season. We used 25 pictures and we had 8 different observers that did our picture test.

    Variables
    • ID_photo: The individual identification of the photo, from 1...
  11. f

    A Multilaboratory Comparison of Calibration Accuracy and the Performance of...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tripathy, Ashutosh; Fischle, Wolfgang; Perdue, Erby E.; Solovyova, Alexandra S.; Brennerman, William; Narlikar, Geeta J.; Ma, Jia; Unzai, Satoru; Peverelli, Martin G.; Scott, David J.; Staunton, David; Kokona, Bashkim; Wolberger, Cynthia; Dean, William L.; Besong, Tabot M. D.; Modrak-Wojcik, Anna; Herr, Andrew B.; Kosek, Dalibor; Kwon, Hyewon; Bakhtina, Marina M.; Richter, Klaus; Ghirlando, Rodolfo; Luger, Karolin; Maynard, Ernest L.; Finn, Ron M.; Wolff, Martin; Luque-Ortega, Juan R.; Leech, Andrew P.; Uchiyama, Susumu; Wang, Szu-Huan; Bzowska, Agnieszka; Prevelige, Peter E.; Mok, Yee-Foong; Bedwell, Gregory J.; Harding, Stephen E.; Schuck, Peter; Eckert, Debra M.; Mücke, Norbert; Noda, Masanori; Seravalli, Javier G.; Fagan, Jeffrey A.; Chaton, Catherine T.; Chaires, Jonathan B.; Song, Renjie; Connaghan, Keith D.; Le Roy, Aline; Rowe, Arthur J.; Wubben, Jacinta M.; Wielgus-Kutrowska, Beata; Gruber, Anna Vitlin; Ringel, Alison E.; Rufer, Arne C.; Larson, Adam; Ebel, Christine; Perkins, Stephen J.; Pawelek, Peter D.; Tessmer, Ingrid; Wright, Edward; Eisenstein, Edward; Cifre, José G. Hernández; Becker, Donald F.; Bekdemir, Ahmet; Piszczek, Grzegorz; Lilie, Hauke; von Hippel, Peter H.; Crowley, Kimberly A.; Alfonso, Carlos; Uebel, Stephan F. W.; Jose, Davis; Wu, Yu-Sung; Nourse, Amanda; Birck, Catherine; Curth, Ute; Brautigam, Chad A.; Kusznir, Eric A.; Rezabkova, Lenka; England, Patrick; Perugini, Matthew A.; Weitzel, Steven E.; Wandrey, Christine; Peterson, Craig L.; Zhao, Huaying; Eisele, Leslie E.; Byron, Olwyn; Obsil, Tomas; de la Torre, José García; Hall, Damien; Bain, David L.; Díez, Ana I.; Nagel-Steger, Luitgard; Escalante, Carlos; Kornblatt, Jack A.; Streicher, Werner W.; Toth IV, Ronald T.; May, Carrie A.; Cölfen, Helmut; Gustafsson, Henning; Kim, Soon-Jong; Sumida, John P.; Jao, Shu-Chuan; Isaac, Richard S.; Krayukhina, Elena; Raynal, Bertrand D. E.; Howell, Elizabeth E.; Rosenberg, Rose; Laue, Thomas M.; Szczepanowski, Roman H.; Krzizike, Daniel; Strauss, Holger M.; Swygert, Sarah G.; Arisaka, Fumio; Stott, Katherine; Park, Chad K.; Attali, Ilan; Prag, Gali; Gor, Jayesh; Stoddard, Caitlin; Daviter, Tina; Park, Jin-Ku; Fairman, Robert (2015). A Multilaboratory Comparison of Calibration Accuracy and the Performance of External References in Analytical Ultracentrifugation [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001940750
    Explore at:
    Dataset updated
    May 21, 2015
    Authors
    Tripathy, Ashutosh; Fischle, Wolfgang; Perdue, Erby E.; Solovyova, Alexandra S.; Brennerman, William; Narlikar, Geeta J.; Ma, Jia; Unzai, Satoru; Peverelli, Martin G.; Scott, David J.; Staunton, David; Kokona, Bashkim; Wolberger, Cynthia; Dean, William L.; Besong, Tabot M. D.; Modrak-Wojcik, Anna; Herr, Andrew B.; Kosek, Dalibor; Kwon, Hyewon; Bakhtina, Marina M.; Richter, Klaus; Ghirlando, Rodolfo; Luger, Karolin; Maynard, Ernest L.; Finn, Ron M.; Wolff, Martin; Luque-Ortega, Juan R.; Leech, Andrew P.; Uchiyama, Susumu; Wang, Szu-Huan; Bzowska, Agnieszka; Prevelige, Peter E.; Mok, Yee-Foong; Bedwell, Gregory J.; Harding, Stephen E.; Schuck, Peter; Eckert, Debra M.; Mücke, Norbert; Noda, Masanori; Seravalli, Javier G.; Fagan, Jeffrey A.; Chaton, Catherine T.; Chaires, Jonathan B.; Song, Renjie; Connaghan, Keith D.; Le Roy, Aline; Rowe, Arthur J.; Wubben, Jacinta M.; Wielgus-Kutrowska, Beata; Gruber, Anna Vitlin; Ringel, Alison E.; Rufer, Arne C.; Larson, Adam; Ebel, Christine; Perkins, Stephen J.; Pawelek, Peter D.; Tessmer, Ingrid; Wright, Edward; Eisenstein, Edward; Cifre, José G. Hernández; Becker, Donald F.; Bekdemir, Ahmet; Piszczek, Grzegorz; Lilie, Hauke; von Hippel, Peter H.; Crowley, Kimberly A.; Alfonso, Carlos; Uebel, Stephan F. W.; Jose, Davis; Wu, Yu-Sung; Nourse, Amanda; Birck, Catherine; Curth, Ute; Brautigam, Chad A.; Kusznir, Eric A.; Rezabkova, Lenka; England, Patrick; Perugini, Matthew A.; Weitzel, Steven E.; Wandrey, Christine; Peterson, Craig L.; Zhao, Huaying; Eisele, Leslie E.; Byron, Olwyn; Obsil, Tomas; de la Torre, José García; Hall, Damien; Bain, David L.; Díez, Ana I.; Nagel-Steger, Luitgard; Escalante, Carlos; Kornblatt, Jack A.; Streicher, Werner W.; Toth IV, Ronald T.; May, Carrie A.; Cölfen, Helmut; Gustafsson, Henning; Kim, Soon-Jong; Sumida, John P.; Jao, Shu-Chuan; Isaac, Richard S.; Krayukhina, Elena; Raynal, Bertrand D. E.; Howell, Elizabeth E.; Rosenberg, Rose; Laue, Thomas M.; Szczepanowski, Roman H.; Krzizike, Daniel; Strauss, Holger M.; Swygert, Sarah G.; Arisaka, Fumio; Stott, Katherine; Park, Chad K.; Attali, Ilan; Prag, Gali; Gor, Jayesh; Stoddard, Caitlin; Daviter, Tina; Park, Jin-Ku; Fairman, Robert
    Description

    Analytical ultracentrifugation (AUC) is a first principles based method to determine absolute sedimentation coefficients and buoyant molar masses of macromolecules and their complexes, reporting on their size and shape in free solution. The purpose of this multi-laboratory study was to establish the precision and accuracy of basic data dimensions in AUC and validate previously proposed calibration techniques. Three kits of AUC cell assemblies containing radial and temperature calibration tools and a bovine serum albumin (BSA) reference sample were shared among 67 laboratories, generating 129 comprehensive data sets. These allowed for an assessment of many parameters of instrument performance, including accuracy of the reported scan time after the start of centrifugation, the accuracy of the temperature calibration, and the accuracy of the radial magnification. The range of sedimentation coefficients obtained for BSA monomer in different instruments and using different optical systems was from 3.655 S to 4.949 S, with a mean and standard deviation of (4.304 ± 0.188) S (4.4%). After the combined application of correction factors derived from the external calibration references for elapsed time, scan velocity, temperature, and radial magnification, the range of s-values was reduced 7-fold with a mean of 4.325 S and a 6-fold reduced standard deviation of ± 0.030 S (0.7%). In addition, the large data set provided an opportunity to determine the instrument-to-instrument variation of the absolute radial positions reported in the scan files, the precision of photometric or refractometric signal magnitudes, and the precision of the calculated apparent molar mass of BSA monomer and the fraction of BSA dimers. These results highlight the necessity and effectiveness of independent calibration of basic AUC data dimensions for reliable quantitative studies.

  12. SZ dataset

    • kaggle.com
    zip
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khurram (2023). SZ dataset [Dataset]. https://www.kaggle.com/datasets/khurramejaz/sz-dataset
    Explore at:
    zip(22110107 bytes)Available download formats
    Dataset updated
    Mar 22, 2023
    Authors
    khurram
    Description

    Image segmentation is challenging task in field of medical image processing. Magnetic resonance imaging is helpful to doctor for detection of human brain tumor within three sources of images (axil, corneal, sagittal). MR images are nosier and detection of brain tumor location as feature is more complicated. Level set methods have been applied but due to human interaction they are affected so appropriate contour has been generated in discontinuous regions and pathological human brain tumor portion highlighted after applying binarization, removing unessential objects; therefore contour has been generated. Then to classify tumor for segmentation hybrid Fuzzy K Mean-Self Organization Mapping (FKM-SOM) for variation of intensities is used. For improved segmented accuracy, classification has been performed, mainly features are extracted using Discrete Wavelet Transformation (DWT) then reduced using Principal Component Analysis (PCA). Thirteen features from every image of dataset have been classified for accuracy using Support Vector Machine (SVM) kernel classification (RBF, linear, polygon) so results have been achieved using evaluation parameters like Fscore, Precision, accuracy, specificity and recall.

  13. h

    Eyes-Detection

    • huggingface.co
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Porri (2025). Eyes-Detection [Dataset]. https://huggingface.co/datasets/AndreaPorri/Eyes-Detection
    Explore at:
    Dataset updated
    Sep 17, 2025
    Authors
    Andrea Porri
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Accurate Eyes Detection Dataset

      Dataset Description
    

    This COCO dataset is designed for real-time "precise eye detection" task under various conditions. It contains highly accurate bounding box annotations, manually retraced with Roboflow to ensure maximum precision.

      Dataset Splits
    

    This dataset is intentionally provided as a single training split containing all 72,317 examples. This design choice allows researchers to:

    Create custom split ratios tailored to… See the full description on the dataset page: https://huggingface.co/datasets/AndreaPorri/Eyes-Detection.

  14. LLM Feedback Collection

    • kaggle.com
    zip
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LLM Feedback Collection [Dataset]. https://www.kaggle.com/datasets/thedevastator/fine-grained-gpt-4-evaluation
    Explore at:
    zip(159502027 bytes)Available download formats
    Dataset updated
    Nov 23, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LLM Feedback Collection

    Induce fine-grained evaluation capabilities into language models

    By Huggingface Hub [source]

    About this dataset

    This dataset contains 100,000 feedback responses from GPT-4 AI models along with rubrics designed to evaluate both absolute and ranking scores. Each response is collected through a comprehensive evaluation process that takes into account the model's feedback, instruction, criteria for scoring, referenced answer and input given. This data provides researchers and developers with valuable insights into the performance of their AI models on various tasks as well as the ability to compare them against one another using precise and accurate measures. Each response is accompanied by five descriptive scores that give a detailed overview of its quality in terms of relevance to the input given, accuracy in reference to the reference answer provided, coherence between different parts of the output such as grammar and organization, fluency in expression of ideas without errors or unnecessary repetitions, and overall productivity accounting for all other factors combined. With this dataset at your disposal, you will be able to evaluate each output qualitatively without having to manually inspect every single response

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains feedback from GPT-4 models, along with associated rubrics for absolute and ranking scoring. It can be used to evaluate the performance of GPT-4 models on different challenging tasks.

    In order to use this dataset effectively, it is important to understand the data provided in each column: - orig_feedback – Feedback given by the original GPT-4 model - orig_score2_description – Description of the second score given to the original GPT-4 model - orig_reference_answer – Reference answer used to evaluate the original GPT-4 model
    - output – Output from the fine-grained evaluation
    - orig_response – Response from the original GPT-4 model * orig_criteria – Criteria used to evaluate the original GPT-4 model *orig_instruction– Instruction given to the original GPT 4 model *orig_score3 _description– Description of third score given to

    Research Ideas

    • Data-driven evaluation of GPT-4 models using the absolute and ranking scores collected from this dataset.
    • Training a deep learning model to automate the assessment of GPT-4 responses based on the rubrics provided in this dataset.
    • Building a semantic search engine using GPT-4 that is able to identify relevant responses more accurately with the help of this dataset's data collection metrics and rubrics for scoring

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------------------|:---------------------------------------------------------------| | orig_feedback | Feedback from the evaluator. (Text) | | orig_score2_description | Description of the second score given by the evaluator. (Text) | | orig_reference_answer | Reference answer used to evaluate the model response. (Text) | | output | Output from the GPT-4 model. (Text) | | orig_response | Original response from the GPT-4 model. (Text) | | orig_criteria | Criteria used by the evaluator to rate the response. (Text) | | orig_instruction | Instructions provided by the evaluator. (Text) | | orig_score3_description | Description of the third score given by the evaluator. (Text) | | orig_score5_description | Description of the fifth score given by the evaluator. (Text) | | orig_score1_description | Description of the first score given by the evaluator. (Text) | | input | Input given to the evaluation. (Text) | | orig_score4_description | Description of the fourth score given by the evalua...

  15. u

    Data from: Quantifying accuracy and precision from continuous response data...

    • fdr.uni-hamburg.de
    csv, r
    Updated May 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruns, Patrick (2022). Quantifying accuracy and precision from continuous response data in studies of spatial perception and crossmodal recalibration [Dataset]. http://doi.org/10.25592/uhhfdm.10183
    Explore at:
    csv, rAvailable download formats
    Dataset updated
    May 4, 2022
    Dataset provided by
    Universität Hamburg
    Authors
    Bruns, Patrick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data and code associated with the study "Quantifying accuracy and precision from continuous response data in studies of spatial perception and crossmodal recalibration" by Patrick Bruns, Caroline Thun, and Brigitte Röder.

    example_code.R contains analysis code that can be used to to calculate error-based and regression-based localization performance metrics from single-subject response data with a working example in R. It requires as inputs a numeric vector containing the stimulus location (true value) in each trial and a numeric vector containing the corresponding localization response (perceived value) in each trial.

    example_data.csv contains the data used in the working example of the analysis code.

    localization.csv contains extracted localization performance metrics from 188 subjects which were analyzed in the study to assess the agreement between error-based and regression-based measures of accuracy and precision. The subjects had all naively performed an azimuthal sound localization task (see related identifiers for the underlying raw data).

    recalibration.csv contains extracted localization performance metrics from a subsample of 57 subjects in whom data from a second sound localization test, performed after exposure to audiovisual stimuli in which the visual stimulus was consistently presented 13.5° to the right of the sound source, were available. The file contains baseline performance (pre) and changes in performance after audiovisual exposure relative to baseline (delta) in each of the localization performance metrics.

    Localization performance metrics were either derived from the single-trial localization errors (error-based approach) or from a linear regression of localization responses on the actual target locations (regression-based approach).The following localization performance metrics were included in the study:

    bias: overall bias of localization responses to the left (negative values) or to the right (positive values), equivalent to constant error (CE) in error-based approaches and intercept in regression-based approaches

    absolute constant error (aCE): absolute value of bias (or CE), indicates the amount of bias irrespective of direction

    mean absolute contant error (maCE): mean of the aCE per target location, reflects over- or underestimation of peripheral target locations

    variable error (VE): mean of the standard deviations (SD) of the single-trial localization errors at each target location

    pooled variable error (pVE): SD of the single-trial localization errors pooled across trials from all target locations

    absolute error (AE): mean of the absolute values of the single-trial localization errors, sensitive to both bias and variability of the localization responses

    slope: slope of the regression model function, indicates an overestimation (values > 1) or underestimation (values < 1) of peripheral target locations

    R2: coefficient of determination of the regression model, indicates the goodness of the fit of the localization responses to the regression line

  16. f

    Data from: Fast and Accurate Machine Learning Strategy for Calculating...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Mar 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haranczyk, Maciej; Snurr, Randall Q.; Gopalan, Arun; Kancharlapalli, Srinivasu (2021). Fast and Accurate Machine Learning Strategy for Calculating Partial Atomic Charges in Metal–Organic Frameworks [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000804198
    Explore at:
    Dataset updated
    Mar 19, 2021
    Authors
    Haranczyk, Maciej; Snurr, Randall Q.; Gopalan, Arun; Kancharlapalli, Srinivasu
    Description

    Computational high-throughput screening using molecular simulations is a powerful tool for identifying top-performing metal–organic frameworks (MOFs) for gas storage and separation applications. Accurate partial atomic charges are often required to model the electrostatic interactions between the MOF and the adsorbate, especially when the adsorption involves molecules with dipole or quadrupole moments such as water and CO2. Although ab initio methods can be used to calculate accurate partial atomic charges, these methods are impractical for screening large material databases because of the high computational cost. We developed a random forest machine learning model to predict the partial atomic charges in MOFs using a small yet meaningful set of features that represent both the elemental properties and the local environment of each atom. The model was trained and tested on a collection of about 320 000 density-derived electrostatic and chemical (DDEC) atomic charges calculated on a subset of the Computation-Ready Experimental Metal–Organic Framework (CoRE MOF-2019) database and separately on charge model 5 (CM5) charges. The model predicts accurate atomic charges for MOFs at a fraction of the computational cost of periodic density functional theory (DFT) and is found to be transferable to other porous molecular crystals and zeolites. A strong correlation is observed between the partial atomic charge and the average electronegativity difference between the central atom and its bonded neighbors.

  17. Break (Question Decomposition Meaning)

    • kaggle.com
    zip
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Break (Question Decomposition Meaning) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-the-mysteries-of-language-understanding-w/discussion
    Explore at:
    zip(15724648 bytes)Available download formats
    Dataset updated
    Dec 1, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Break (Question Decomposition Meaning)

    Human annotated dataset of natural language questions and their Question Decomp

    By Huggingface Hub [source]

    About this dataset

    BreakData Welcome to BreakData, an innovative and cutting-edge dataset devoted to exploring language understanding. This dataset contains a wealth of information related to question decomposition, operators, splits, sources, and allowed tokens and can be used to answer questions with precision. With deep insights into how humans comprehend and interpret language, BreakData provides an immense value for researchers developing sophisticated models that can help advance AI technologies. Our goal is to enable the development of more complex natural language processing which can be used in various applications such as automated customer support, chatbots for health care advice or automated marketing campaigns. Dive into this intriguing dataset now and discover how your work could change the world!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides an exciting opportunity to explore and understand the complexities of language understanding. With this dataset, you can train models for natural language processing (NLP) activities such as question answering, text analytics, automated dialog systems, and more.

    In order to make most effective use of the BreakData dataset, it’s important to know how it is organized and what types of data are included in each file. The BreakData dataset is broken down into nine different files: - QDMR_train.csv - QDMR_validation.csv - QDMR-highlevel_train.csv
    - QDMR-highlevel_test.csv - logicalforms_train.csv - logicalforms_validation.csv
    - QDMRlexicon_train.csv - QDMRLexicon_test csv
    - QDHMLexiconHighLevelTest csv

    Each file contains a different set of data that can be used to train your models for natural language understanding tasks or analyze existing questions or commands with accurate decompositions and operators from these datasets into their component parts and understand their relationships with each other:

    1) The QDMR files include questions or statements from common domains like health care or banking that need to be interpreted according to a series of operators (elements such as verbs). This task requires identifying keywords in the statement or question text that trigger certain responses indicating variable values and variables themselves so any model trained on these datasets will need to accurately identify entities like time references (dates/times), monetary amounts, Boolean values (yes/no), etc., as well as relationships between those entities–all while following a defined rule set specific domain languages specialize in interpreting such text accurately by modeling complex context dependent queries requiring linguistic analysis in multiple steps through rigorous training on this kind of data would optimize decisions made by machines based on human relevant interactions like conversations inducing more accurate next best actions resulting in better decision making respectively matching human scale solution accuracy rate given growing customer demands being served increasingly faster leveraging machine learning models powered by breakdata NLP layer accuracy enabled interpreters able do seamless inference while using this comprehensive training set providing deeper insights with improved results transforming customer engagement quality at unprecedented rate .

    2) The LogicalForms files include logical forms containing the building blocks (elements such as operators) for linking ideas together together across different sets of incoming variables which

    Research Ideas

    • Developing advanced natural language processing models to analyze questions using decompositions, operators, and splits.
    • Training a machine learning algorithm to predict the semantic meaning of questions based on their decomposition and split.
    • Conducting advanced text analytics by using the allowed tokens dataset to map out how people communicate specific concepts in different contexts or topics

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See O...

  18. f

    Data from: Meeting Measurement Precision Requirements for Effective...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tabor, Jeffrey J.; DeLateur, Nicholas A.; Samineni, Meher; Teague, Brian; Sexton, John T.; Weiss, Ron; Beal, Jacob; Castillo-Hair, Sebastian (2022). Meeting Measurement Precision Requirements for Effective Engineering of Genetic Regulatory Networks [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000306287
    Explore at:
    Dataset updated
    Feb 14, 2022
    Authors
    Tabor, Jeffrey J.; DeLateur, Nicholas A.; Samineni, Meher; Teague, Brian; Sexton, John T.; Weiss, Ron; Beal, Jacob; Castillo-Hair, Sebastian
    Description

    Reliable, predictable engineering of cellular behavior is one of the key goals of synthetic biology. As the field matures, biological engineers will become increasingly reliant on computer models that allow for the rapid exploration of design space prior to the more costly construction and characterization of candidate designs. The efficacy of such models, however, depends on the accuracy of their predictions, the precision of the measurements used to parametrize the models, and the tolerance of biological devices for imperfections in modeling and measurement. To better understand this relationship, we have derived an Engineering Error Inequality that provides a quantitative mathematical bound on the relationship between predictability of results, model accuracy, measurement precision, and device characteristics. We apply this relation to estimate measurement precision requirements for engineering genetic regulatory networks given current model and device characteristics, recommending a target standard deviation of 1.5-fold. We then compare these requirements with the results of an interlaboratory study to validate that these requirements can be met via flow cytometry with matched instrument channels and an independent calibrant. On the basis of these results, we recommend a set of best practices for quality control of flow cytometry data and discuss how these might be extended to other measurement modalities and applied to support further development of genetic regulatory network engineering.

  19. F

    Urdu Brainstorming Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Urdu Brainstorming Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/urdu-brainstorming-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Urdu Brainstorming Prompt-Response Dataset, a meticulously curated collection of 2000 prompt and response pairs. This dataset is a valuable resource for enhancing the creative and generative abilities of Language Models (LMs), a critical aspect in advancing generative AI.

    Dataset Content

    This brainstorming dataset comprises a diverse set of prompts and responses where the prompt contains instruction, context, constraints, and restrictions while completion contains the most accurate response list for the given prompt. Both these prompts and completions are available in Urdu language.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Urdu people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.

    Prompt Diversity

    To ensure diversity, our brainstorming dataset features prompts of varying complexity levels, ranging from easy to medium and hard. The prompts also vary in length, including short, medium, and long prompts, providing a comprehensive range. Furthermore, the dataset includes prompts with constraints and persona restrictions, making it exceptionally valuable for LLM training.

    Response Formats

    Our dataset accommodates diverse learning experiences, offering responses across different domains depending on the prompt. For these brainstorming prompts, responses are generally provided in list format. These responses encompass text strings, numerical values, and dates, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Urdu Brainstorming Prompt Completion Dataset is available in both JSON and CSV formats. It includes comprehensive annotation details, including a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, and the presence of rich text.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Urdu version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. We continuously work to expand this dataset, ensuring its ongoing growth and relevance. Additionally, FutureBeeAI offers the flexibility to curate custom brainstorming prompt and completion datasets tailored to specific requirements, providing you with customization options.

    License

    This dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Urdu Brainstorming Prompt-Completion Dataset to enhance the creative and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  20. f

    Data from: Accurate Thermochemistry with Small Data Sets: A Bond Additivity...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Jun 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grambow, Colin A.; Green, William H.; Li, Yi-Pei (2019). Accurate Thermochemistry with Small Data Sets: A Bond Additivity Correction and Transfer Learning Approach [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000116833
    Explore at:
    Dataset updated
    Jun 27, 2019
    Authors
    Grambow, Colin A.; Green, William H.; Li, Yi-Pei
    Description

    Machine learning provides promising new methods for accurate yet rapid prediction of molecular properties, including thermochemistry, which is an integral component of many computer simulations, particularly automated reaction mechanism generation. Often, very large data sets with tens of thousands of molecules are required for training the models, but most data sets of experimental or high-accuracy quantum mechanical quality are much smaller. To overcome these limitations, we calculate new high-level data sets and derive bond additivity corrections to significantly improve enthalpies of formation. We adopt a transfer learning technique to train neural network models that achieve good performance even with a relatively small set of high-accuracy data. The training data for the entropy model are carefully selected so that important conformational effects are captured. The resulting models are generally applicable thermochemistry predictors for organic compounds with oxygen and nitrogen heteroatoms that approach experimental and coupled cluster accuracy while only requiring molecular graph inputs. Due to their versatility and the ease of adding new training data, they are poised to replace conventional estimation methods for thermochemical parameters in reaction mechanism generation. Since high-accuracy data are often sparse, similar transfer learning approaches are expected to be useful for estimating many other molecular properties.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Domjan, Peter; Angyal, Viola; Bertalan, Adam; Vingender, Istvan; Dinya, Elek (2025). A New Bayesian Approach to Increase Measurement Accuracy Using a Precision Entropy Indicator [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14417120

Data from: A New Bayesian Approach to Increase Measurement Accuracy Using a Precision Entropy Indicator

Related Article
Explore at:
Dataset updated
Feb 25, 2025
Dataset provided by
Semmelweis University
Authors
Domjan, Peter; Angyal, Viola; Bertalan, Adam; Vingender, Istvan; Dinya, Elek
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

"We believe that by accounting for the inherent uncertainty in the system during each measurement, the relationship between cause and effect can be assessed more accurately, potentially reducing the duration of research."

Short description

This dataset was created as part of a research project investigating the efficiency and learning mechanisms of a Bayesian adaptive search algorithm supported by the Imprecision Entropy Indicator (IEI) as a novel method. It includes detailed statistical results, posterior probability values, and the weighted averages of IEI across multiple simulations aimed at target localization within a defined spatial environment. Control experiments, including random search, random walk, and genetic algorithm-based approaches, were also performed to benchmark the system's performance and validate its reliability.

The task involved locating a target area centered at (100; 100) within a radius of 10 units (Research_area.png), inside a circular search space with a radius of 100 units. The search process continued until 1,000 successful target hits were achieved.

To benchmark the algorithm's performance and validate its reliability, control experiments were conducted using alternative search strategies, including random search, random walk, and genetic algorithm-based approaches. These control datasets serve as baselines, enabling comprehensive comparisons of efficiency, randomness, and convergence behavior across search methods, thereby demonstrating the effectiveness of our novel approach.

Uploaded files

The first dataset contains the average IEI values, generated by randomly simulating 300 x 1 hits for 10 bins per quadrant (4 quadrants in total) using the Python programming language, and calculating the corresponding IEI values. This resulted in a total of 4 x 10 x 300 x 1 = 12,000 data points. The summary of the IEI values by quadrant and bin is provided in the file results_1_300.csv. The calculation of IEI values for averages is based on likelihood, using an absolute difference-based approach for the likelihood probability computation. IEI_Likelihood_Based_Data.zip

The weighted IEI average values for likelihood calculation (Bayes formula) are provided in the file Weighted_IEI_Average_08_01_2025.xlsx

This dataset contains the results of a simulated target search experiment using Bayesian posterior updates and Imprecision Entropy Indicators (IEI). Each row represents a hit during the search process, including metrics such as Shannon entropy (H), Gini index (G), average distance, angular deviation, and calculated IEI values. The dataset also includes bin-specific posterior probability updates and likelihood calculations for each iteration. The simulation explores adaptive learning and posterior penalization strategies to optimize the search efficiency. Our Bayesian adaptive searching system source code (search algorithm, 1000 target searches): IEI_Self_Learning_08_01_2025.pyThis dataset contains the results of 1,000 iterations of a successful target search simulation. The simulation runs until the target is successfully located for each iteration. The dataset includes further three main outputs: a) Results files (results{iteration_number}.csv): Details of each hit during the search process, including entropy measures, Gini index, average distance and angle, Imprecision Entropy Indicators (IEI), coordinates, and the bin number of the hit. b) Posterior updates (Pbin_all_steps_{iter_number}.csv): Tracks the posterior probability updates for all bins during the search process acrosations multiple steps. c) Likelihoodanalysis(likelihood_analysis_{iteration_number}.csv): Contains the calculated likelihood values for each bin at every step, based on the difference between the measured IEI and pre-defined IE bin averages. IEI_Self_Learning_08_01_2025.py

Based on the mentioned Python source code (see point 3, Bayesian adaptive searching method with IEI values), we performed 1,000 successful target searches, and the outputs were saved in the:Self_learning_model_test_output.zip file.

Bayesian Search (IEI) from different quadrant. This dataset contains the results of Bayesian adaptive target search simulations, including various outputs that represent the performance and analysis of the search algorithm. The dataset includes: a) Heatmaps (Heatmap_I_Quadrant, Heatmap_II_Quadrant, Heatmap_III_Quadrant, Heatmap_IV_Quadrant): These heatmaps represent the search results and the paths taken from each quadrant during the simulations. They indicate how frequently the system selected each bin during the search process. b) Posterior Distributions (All_posteriors, Probability_distribution_posteriors_values, CDF_posteriors_values): Generated based on posterior values, these files track the posterior probability updates, including cumulative distribution functions (CDF) and probability distributions. c) Macro Summary (summary_csv_macro): This file aggregates metrics and key statistics from the simulation. It summarizes the results from the individual results.csv files. d) Heatmap Searching Method Documentation (Bayesian_Heatmap_Searching_Method_05_12_2024): This document visualizes the search algorithm's path, showing how frequently each bin was selected during the 1,000 successful target searches. e) One-Way ANOVA Analysis (Anova_analyze_dataset, One_way_Anova_analysis_results): This includes the database and SPSS calculations used to examine whether the starting quadrant influences the number of search steps required. The analysis was conducted at a 5% significance level, followed by a Games-Howell post hoc test [43] to identify which target-surrounding quadrants differed significantly in terms of the number of search steps. Results were saved in the Self_learning_model_test_results.zip

This dataset contains randomly generated sequences of bin selections (1-40) from a control search algorithm (random search) used to benchmark the performance of Bayesian-based methods. The process iteratively generates random numbers until a stopping condition is met (reaching target bins 1, 11, 21, or 31). This dataset serves as a baseline for analyzing the efficiency, randomness, and convergence of non-adaptive search strategies. The dataset includes the following: a) The Python source code of the random search algorithm. b) A file (summary_random_search.csv) containing the results of 1000 successful target hits. c) A heatmap visualizing the frequency of search steps for each bin, providing insight into the distribution of steps across the bins. Random_search.zip

This dataset contains the results of a random walk search algorithm, designed as a control mechanism to benchmark adaptive search strategies (Bayesian-based methods). The random walk operates within a defined space of 40 bins, where each bin has a set of neighboring bins. The search begins from a randomly chosen starting bin and proceeds iteratively, moving to a randomly selected neighboring bin, until one of the stopping conditions is met (bins 1, 11, 21, or 31). The dataset provides detailed records of 1,000 random walk iterations, with the following key components: a) Individual Iteration Results: Each iteration's search path is saved in a separate CSV file (random_walk_results_.csv), listing the sequence of steps taken and the corresponding bin at each step. b) Summary File: A combined summary of all iterations is available in random_walk_results_summary.csv, which aggregates the step-by-step data for all 1,000 random walks. c) Heatmap Visualization: A heatmap file is included to illustrate the frequency distribution of steps across bins, highlighting the relative visit frequencies of each bin during the random walks. d) Python Source Code: The Python script used to generate the random walk dataset is provided, allowing reproducibility and customization for further experiments. Random_walk.zip

This dataset contains the results of a genetic search algorithm implemented as a control method to benchmark adaptive Bayesian-based search strategies. The algorithm operates in a 40-bin search space with predefined target bins (1, 11, 21, 31) and evolves solutions through random initialization, selection, crossover, and mutation over 1000 successful runs. Dataset Components: a) Run Results: Individual run data is stored in separate files (genetic_algorithm_run_.csv), detailing: Generation: The generation number. Fitness: The fitness score of the solution. Steps: The path length in bins. Solution: The sequence of bins visited. b) Summary File: summary.csv consolidates the best solutions from all runs, including their fitness scores, path lengths, and sequences. c) All Steps File: summary_all_steps.csv records all bins visited during the runs for distribution analysis. d) A heatmap was also generated for the genetic search algorithm, illustrating the frequency of bins chosen during the search process as a representation of the search pathways.Genetic_search_algorithm.zip

Technical Information

The dataset files have been compressed into a standard ZIP archive using Total Commander (version 9.50). The ZIP format ensures compatibility across various operating systems and tools.

The XLSX files were created using Microsoft Excel Standard 2019 (Version 1808, Build 10416.20027)

The Python program was developed using Visual Studio Code (Version 1.96.2, user setup), with the following environment details: Commit fabd6a6b30b49f79a7aba0f2ad9df9b399473380f, built on 2024-12-19. The Electron version is 32.6, and the runtime environment includes Chromium 128.0.6263.186, Node.js 20.18.1, and V8 12.8.374.38-electron.0. The operating system is Windows NT x64 10.0.19045.

The statistical analysis included in this dataset was partially conducted using IBM SPSS Statistics, Version 29.0.1.0

The CSV files in this dataset were created following European standards, using a semicolon (;) as the delimiter instead of a comma, encoded in UTF-8 to ensure compatibility with a wide

Search
Clear search
Close search
Google apps
Main menu