48 datasets found
  1. f

    Table1_Enhancing biomechanical machine learning with limited data:...

    • frontiersin.figshare.com
    pdf
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Frontiers
    Authors
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.

  2. d

    Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

    • datarade.ai
    Updated Sep 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate
    Explore at:
    Dataset updated
    Sep 18, 2022
    Dataset authored and provided by
    Ainnotate
    Area covered
    Tokelau, Denmark, Tonga, Canada, Cabo Verde, Korea (Democratic People's Republic of), Germany, Syrian Arab Republic, Brazil, Ireland
    Description

    Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

    Ainnotate currently provides synthetic datasets in the following domains and use cases.

    Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards

  3. f

    DataSheet1_Generating synthetic multidimensional molecular time series data...

    • figshare.com
    pdf
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary An; Chase Cockrell (2023). DataSheet1_Generating synthetic multidimensional molecular time series data for machine learning: considerations.PDF [Dataset]. http://doi.org/10.3389/fsysb.2023.1188009.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Frontiers
    Authors
    Gary An; Chase Cockrell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (subsequently referred to as synthetic mediator trajectories or SMTs); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem in terms of making assumptions about the statistical distributions of this type of data, and the inability to use ab initio simulations due to the state of perpetual epistemic incompleteness in cellular/molecular biology. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for perpetual epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Maximal Entropy Principle. These procedures provide for the generation of SMT that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization.

  4. Synthetic Data Generation Market Analysis, Size, and Forecast 2025-2029:...

    • technavio.com
    pdf
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Synthetic Data Generation Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Italy, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/synthetic-data-generation-market-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 3, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img

    Synthetic Data Generation Market Size 2025-2029

    The synthetic data generation market size is forecast to increase by USD 4.39 billion, at a CAGR of 61.1% between 2024 and 2029.

    The market is experiencing significant growth, driven by the escalating demand for data privacy protection. With increasing concerns over data security and the potential risks associated with using real data, synthetic data is gaining traction as a viable alternative. Furthermore, the deployment of large language models is fueling market expansion, as these models can generate vast amounts of realistic and diverse data, reducing the reliance on real-world data sources. However, high costs associated with high-end generative models pose a challenge for market participants. These models require substantial computational resources and expertise to develop and implement effectively. Companies seeking to capitalize on market opportunities must navigate these challenges by investing in research and development to create more cost-effective solutions or partnering with specialists in the field. Overall, the market presents significant potential for innovation and growth, particularly in industries where data privacy is a priority and large language models can be effectively utilized.

    What will be the Size of the Synthetic Data Generation Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, driven by the increasing demand for data-driven insights across various sectors. Data processing is a crucial aspect of this market, with a focus on ensuring data integrity, privacy, and security. Data privacy-preserving techniques, such as data masking and anonymization, are essential in maintaining confidentiality while enabling data sharing. Real-time data processing and data simulation are key applications of synthetic data, enabling predictive modeling and data consistency. Data management and workflow automation are integral components of synthetic data platforms, with cloud computing and model deployment facilitating scalability and flexibility. Data governance frameworks and compliance regulations play a significant role in ensuring data quality and security. Deep learning models, variational autoencoders (VAEs), and neural networks are essential tools for model training and optimization, while API integration and batch data processing streamline the data pipeline. Machine learning models and data visualization provide valuable insights, while edge computing enables data processing at the source. Data augmentation and data transformation are essential techniques for enhancing the quality and quantity of synthetic data. Data warehousing and data analytics provide a centralized platform for managing and deriving insights from large datasets. Synthetic data generation continues to unfold, with ongoing research and development in areas such as federated learning, homomorphic encryption, statistical modeling, and software development. The market's dynamic nature reflects the evolving needs of businesses and the continuous advancements in data technology.

    How is this Synthetic Data Generation Industry segmented?

    The synthetic data generation industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. End-userHealthcare and life sciencesRetail and e-commerceTransportation and logisticsIT and telecommunicationBFSI and othersTypeAgent-based modellingDirect modellingApplicationAI and ML Model TrainingData privacySimulation and testingOthersProductTabular dataText dataImage and video dataOthersGeographyNorth AmericaUSCanadaMexicoEuropeFranceGermanyItalyUKAPACChinaIndiaJapanRest of World (ROW)

    By End-user Insights

    The healthcare and life sciences segment is estimated to witness significant growth during the forecast period.In the rapidly evolving data landscape, the market is gaining significant traction, particularly in the healthcare and life sciences sector. With a growing emphasis on data-driven decision-making and stringent data privacy regulations, synthetic data has emerged as a viable alternative to real data for various applications. This includes data processing, data preprocessing, data cleaning, data labeling, data augmentation, and predictive modeling, among others. Medical imaging data, such as MRI scans and X-rays, are essential for diagnosis and treatment planning. However, sharing real patient data for research purposes or training machine learning algorithms can pose significant privacy risks. Synthetic data generation addresses this challenge by producing realistic medical imaging data, ensuring data privacy while enabling research and development. Moreover

  5. f

    CVD Risk Prediction Synthetic Dataset

    • figshare.com
    pdf
    Updated Sep 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ted Laderas; David Dorr; Nicole Vasilevsky; Shannon McWeeney; Melissa Haendel; Bjorn Pederson (2017). CVD Risk Prediction Synthetic Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.5439991.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 25, 2017
    Dataset provided by
    figshare
    Authors
    Ted Laderas; David Dorr; Nicole Vasilevsky; Shannon McWeeney; Melissa Haendel; Bjorn Pederson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a synthetic dataset to teach students about using clinical and genetic covariates to predict cardiovascular risk in a realistic (but synthetic) dataset.For the workshop materials, please go here: https://github.com/laderast/cvdNight1Contents:1) dataDictionary.pdf - pdf file describing all covariates in the synthetic dataset.2) fullPatientData.csv - csv file with multiple covariates3) genoData.csv - subset of patients in fullPatientData.csv with additional SNP calls.

  6. Z

    Data from: Dataset for publication: Usefulness of synthetic datasets for...

    • data.niaid.nih.gov
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laviale, Martin (2024). Dataset for publication: Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14018390
    Explore at:
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Laviale, Martin
    Venkataramanan, Aishwarya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the dataset and code used to generate synthetic dataset as explained in the paper "Usefulness of synthetic datasets for diatom automatic detection using a deep-learning approach". Dataset : The dataset consists of two components: individual diatom images extracted from publicly available diatom atlases [1,2,3] and individual debris images. - Individual diatom images : currently, the repository consists of 166 diatom species, totalling 9230 images. These images were automatically extracted from atlases using PDF scraping, cleaned and verified by diatom taxonomists. The subfolders within each diatom specie indicates the origin of the images: RA[1], IDF[2], BRG[3]. Additional diatom species and images will be regularly updated in the repository. - Individual debris images : the debris images were extracted from real microscopy images. The repository contains 600 debris objects. Code : Contains the code used to generate synthetic microscopy images. For details on how to use the code, kindly refer to the README file available in synthetic_data_generator/.

  7. f

    datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...

    • frontiersin.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto (2023). datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.pdf [Dataset]. http://doi.org/10.3389/frai.2021.612551.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

  8. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  9. OpenResume: Advancing Career Trajectory Modeling with Anonymized and...

    • zenodo.org
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee (2025). OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets [Dataset]. http://doi.org/10.1109/bigdata62323.2024.10825519
    Explore at:
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Institute of Electrical and Electronics Engineershttp://www.ieee.ro/
    Authors
    Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519

    If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:

    @inproceedings{yamashita2024openresume,

    title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},

    author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},

    booktitle={2024 IEEE International Conference on Big Data (BigData)},

    year={2024},

    organization={IEEE}

    }

    @inproceedings{yamashita2023james,

    title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},

    author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},

    booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},

    year={2023},

    organization={IEEE}

    }

    Data Contents and Organization

    The dataset consists of two primary components:

    • Realistic Data: An anonymized dataset utilizing differential privacy techniques.
    • Synthetic Data: A synthetic dataset generated from real-world job transition graphs.

    The dataset includes the following features:

    • Anonymized User Identifiers: Unique IDs for anonymized users.
    • Anonymized Company Identifiers: Unique IDs for anonymized companies.
    • Normalized Job Titles: Job titles standardized into the ESCO taxonomy.
    • Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.

    Detailed information on how the OpenResume dataset is constructed can be found in our paper.

    Dataset Extension

    Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.

    • Applicable Tasks:
      • Next Job Title Prediction (Career Path Prediction)
      • Next Company Prediction (Career Path Prediction)
      • Turnover Prediction
      • Link Prediction
      • Required Skill Prediction (with ESCO dataset integration)
      • Existing Skill Prediction (with ESCO dataset integration)
      • Job Description Classification (with ESCO dataset integration)
      • Job Title Classification (with ESCO dataset integration)
      • Text Feature-Based Model Development (with ESCO dataset integration)
      • LLM Development for Resume-Related Tasks (with ESCO dataset integration)
      • And more!

    Intended Uses

    The primary objective of OpenResume is to provide an open resource for:

    1. Evaluating and comparing newly developed career models in a standardized manner.
    2. Fostering AI advancements in career trajectory modeling and job market analytics.

    With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.

    While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.

    Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.

    Ethical and Responsible Use

    The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.

    Related Work

    JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
    Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
    IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023

    Fake Resume Attacks: Data Poisoning on Online Job Platforms
    Michiharu Yamashita, Thanh Tran, and Dongwon Lee
    The ACM Web Conference 2024 (WWW), 2024

  10. NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep...

    • zenodo.org
    text/x-python, zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulio Del Corso; Giulio Del Corso; Volpini Federico; Volpini Federico; Claudia Caudai; Claudia Caudai; Davide Moroni; Davide Moroni; Sara Colantonio; Sara Colantonio (2025). NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep learning models [Dataset]. http://doi.org/10.5281/zenodo.15194187
    Explore at:
    zip, text/x-pythonAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giulio Del Corso; Giulio Del Corso; Volpini Federico; Volpini Federico; Claudia Caudai; Claudia Caudai; Davide Moroni; Davide Moroni; Sara Colantonio; Sara Colantonio
    License

    Attribution-NonCommercial-NoDerivs 2.5 (CC BY-NC-ND 2.5)https://creativecommons.org/licenses/by-nc-nd/2.5/
    License information was derived automatically

    Time period covered
    Dec 18, 2024
    Description

    NADA (Not-A-Database) is an easy-to-use geometric shape data generator that allows users to define non-uniform multivariate parameter distributions to test novel methodologies. The full open-source package is provided at GIT:NA_DAtabase. See Technical Report for details on how to use the provided package.

    This database includes 3 repositories:

    • NADA_Dis: Is the model able to correctly characterize/Disentangle a complex latent space?
      The repository contains 3x100,000 synthetic black and white images to test the ability of the models to correctly define a proper latent space (e.g., autoencoders) and disentangle it. The first 100,000 images contain 4 shapes and uniform parameter space distributions, while the other images have a more complex underlying distribution (truncated Gaussian and correlated marginal variables).

    • NADA_OOD: Does the model identify Out-Of-Distribution images?
      The repository contains 100,000 training images (4 different shapes with 3 possible colors located in the upper left corner of the canvas) and 6x100,000 increasingly different sets of images (changing the color class balance, reducing the radius of the shape, moving the shape to the lower left corner) providing increasingly challenging out-of-distribution images.
      This can help to test not only the capability of a model, but also methods that produce reliability estimates and should correctly classify OOD elements as "unreliable" as they are far from the original distributions.

    • NADA_AlEp: Does the model distinguish between different types (Aleatoric/Epistemic) of uncertainties?
      The repository contains 5x100,000 images with different type of noise/uncertainties:
      • NADA_AlEp_0_Clean: Dataset clean of noise to use as a possible training set.
      • NADA_AlEp_1_White_Noise: Epistemic white noise dataset. Each image is perturbed with an amount of white noise randomly sampled from 0% to 90%.
      • NADA_AlEp_2_Deformation: Dataset with Epistemic deformation noise. Each image is deformed by a randomly amount uniformly sampled between 0% and 90%. 0% corresponds to the original image, while 100% is a full deformation to the circumscribing circle.
      • NADA_AlEp_3_Label: Dataset with label noise. Formally, 20% of Triangles of a given color are missclassified as a Square with a random color (among Blue, Orange, and Brown) and viceversa (Squares to Triangles). Label noise introduces \textit{Aleatoric Uncertainty} because it is inherent in the data and cannot be reduced.
      • NADA_AlEp_4_Combined: Combined dataset with all previous sources of uncertainty.

    Each image can be used for classification (shape/color) or regression (radius/area) tasks.

    All datasets can be modified and adapted to the user's research question using the included open source data generator.

  11. f

    DataSheet4_Forecasting SARS-CoV-2 spike protein evolution from small data by...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hallam, Steven J.; Ng, Sarah W. S.; King, Samuel; Hahn, Samuel V.; Salman, Paarsa; J. Ma, Eric; Kagieva, Madina; Hong, Ryan J.; Qi, Ruo Chen; Schwab, Janella C.; Sekhon, Parneet; Reilly, Taylor; Chen, Xinyi E.; Roberts, Tylo; Rostin, Kimia (2024). DataSheet4_Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001320027
    Explore at:
    Dataset updated
    Apr 9, 2024
    Authors
    Hallam, Steven J.; Ng, Sarah W. S.; King, Samuel; Hahn, Samuel V.; Salman, Paarsa; J. Ma, Eric; Kagieva, Madina; Hong, Ryan J.; Qi, Ruo Chen; Schwab, Janella C.; Sekhon, Parneet; Reilly, Taylor; Chen, Xinyi E.; Roberts, Tylo; Rostin, Kimia
    Description

    The emergence of SARS-CoV-2 variants during the COVID-19 pandemic caused frequent global outbreaks that confounded public health efforts across many jurisdictions, highlighting the need for better understanding and prediction of viral evolution. Predictive models have been shown to support disease prevention efforts, such as with the seasonal influenza vaccine, but they require abundant data. For emerging viruses of concern, such models should ideally function with relatively sparse data typically encountered at the early stages of a viral outbreak. Conventional discrete approaches have proven difficult to develop due to the spurious and reversible nature of amino acid mutations and the overwhelming number of possible protein sequences adding computational complexity. We hypothesized that these challenges could be addressed by encoding discrete protein sequences into continuous numbers, effectively reducing the data size while enhancing the resolution of evolutionarily relevant differences. To this end, we developed a viral protein evolution prediction model (VPRE), which reduces amino acid sequences into continuous numbers by using an artificial neural network called a variational autoencoder (VAE) and models their most statistically likely evolutionary trajectories over time using Gaussian process (GP) regression. To demonstrate VPRE, we used a small amount of early SARS-CoV-2 spike protein sequences. We show that the VAE can be trained on a synthetic dataset based on this data. To recapitulate evolution along a phylogenetic path, we used only 104 spike protein sequences and trained the GP regression with the numerical variables to project evolution up to 5 months into the future. Our predictions contained novel variants and the most frequent prediction mapped primarily to a sequence that differed by only a single amino acid from the most reported spike protein within the prediction timeframe. Novel variants in the spike receptor binding domain (RBD) were capable of binding human angiotensin-converting enzyme 2 (ACE2) in silico, with comparable or better binding than previously resolved RBD-ACE2 complexes. Together, these results indicate the utility and tractability of combining deep learning and regression to model viral protein evolution with relatively sparse datasets, toward developing more effective medical interventions.

  12. f

    Data Sheet 1_End-to-end 3D instance segmentation of synthetic data and...

    • frontiersin.figshare.com
    pdf
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel David; Emmanuel Faure (2025). Data Sheet 1_End-to-end 3D instance segmentation of synthetic data and embryo microscopy images with a 3D Mask R-CNN.pdf [Dataset]. http://doi.org/10.3389/fbinf.2024.1497539.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset provided by
    Frontiers
    Authors
    Gabriel David; Emmanuel Faure
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the exploitation of three-dimensional (3D) data in deep learning has gained momentum despite its inherent challenges. The necessity of 3D approaches arises from the limitations of two-dimensional (2D) techniques when applied to 3D data due to the lack of global context. A critical task in medical and microscopy 3D image analysis is instance segmentation, which is inherently complex due to the need for accurately identifying and segmenting multiple object instances in an image. Here, we introduce a 3D adaptation of the Mask R-CNN, a powerful end-to-end network designed for instance segmentation. Our implementation adapts a widely used 2D TensorFlow Mask R-CNN by developing custom TensorFlow operations for 3D Non-Max Suppression and 3D Crop And Resize, facilitating efficient training and inference on 3D data. We validate our 3D Mask R-CNN on two experiences. The first experience uses a controlled environment of synthetic data with instances exhibiting a wide range of anisotropy and noise. Our model achieves good results while illustrating the limit of the 3D Mask R-CNN for the noisiest objects. Second, applying it to real-world data involving cell instance segmentation during the morphogenesis of the ascidian embryo Phallusia mammillata, we show that our 3D Mask R-CNN outperforms the state-of-the-art method, achieving high recall and precision scores. The model preserves cell connectivity, which is crucial for applications in quantitative study. Our implementation is open source, ensuring reproducibility and facilitating further research in 3D deep learning.

  13. f

    DataSheet1_Training Deep Neural Networks to Reconstruct Nanoporous...

    • frontiersin.figshare.com
    pdf
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trushal Sardhara; Roland C. Aydin; Yong Li; Nicolas Piché; Raynald Gauvin; Christian J. Cyron; Martin Ritter (2023). DataSheet1_Training Deep Neural Networks to Reconstruct Nanoporous Structures From FIB Tomography Images Using Synthetic Training Data.pdf [Dataset]. http://doi.org/10.3389/fmats.2022.837006.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Trushal Sardhara; Roland C. Aydin; Yong Li; Nicolas Piché; Raynald Gauvin; Christian J. Cyron; Martin Ritter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Focused ion beam (FIB) tomography is a destructive technique used to collect three-dimensional (3D) structural information at a resolution of a few nanometers. For FIB tomography, a material sample is degraded by layer-wise milling. After each layer, the current surface is imaged by a scanning electron microscope (SEM), providing a consecutive series of cross-sections of the three-dimensional material sample. Especially for nanoporous materials, the reconstruction of the 3D microstructure of the material, from the information collected during FIB tomography, is impaired by the so-called shine-through effect. This effect prevents a unique mapping between voxel intensity values and material phase (e.g., solid or void). It often substantially reduces the accuracy of conventional methods for image segmentation. Here we demonstrate how machine learning can be used to tackle this problem. A bottleneck in doing so is the availability of sufficient training data. To overcome this problem, we present a novel approach to generate synthetic training data in the form of FIB-SEM images generated by Monte Carlo simulations. Based on this approach, we compare the performance of different machine learning architectures for segmenting FIB tomography data of nanoporous materials. We demonstrate that two-dimensional (2D) convolutional neural network (CNN) architectures processing a group of adjacent slices as input data as well as 3D CNN perform best and can enhance the segmentation performance significantly.

  14. f

    DataSheet1_Training machine learning models with synthetic data improves the...

    • frontiersin.figshare.com
    pdf
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruben Doste; Miguel Lozano; Guillermo Jimenez-Perez; Lluis Mont; Antonio Berruezo; Diego Penela; Oscar Camara; Rafael Sebastian (2023). DataSheet1_Training machine learning models with synthetic data improves the prediction of ventricular origin in outflow tract ventricular arrhythmias.PDF [Dataset]. http://doi.org/10.3389/fphys.2022.909372.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Ruben Doste; Miguel Lozano; Guillermo Jimenez-Perez; Lluis Mont; Antonio Berruezo; Diego Penela; Oscar Camara; Rafael Sebastian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In order to determine the site of origin (SOO) in outflow tract ventricular arrhythmias (OTVAs) before an ablation procedure, several algorithms based on manual identification of electrocardiogram (ECG) features, have been developed. However, the reported accuracy decreases when tested with different datasets. Machine learning algorithms can automatize the process and improve generalization, but their performance is hampered by the lack of large enough OTVA databases. We propose the use of detailed electrophysiological simulations of OTVAs to train a machine learning classification model to predict the ventricular origin of the SOO of ectopic beats. We generated a synthetic database of 12-lead ECGs (2,496 signals) by running multiple simulations from the most typical OTVA SOO in 16 patient-specific geometries. Two types of input data were considered in the classification, raw and feature ECG signals. From the simulated raw 12-lead ECG, we analyzed the contribution of each lead in the predictions, keeping the best ones for the training process. For feature-based analysis, we used entropy-based methods to rank the obtained features. A cross-validation process was included to evaluate the machine learning model. Following, two clinical OTVA databases from different hospitals, including ECGs from 365 patients, were used as test-sets to assess the generalization of the proposed approach. The results show that V2 was the best lead for classification. Prediction of the SOO in OTVA, using both raw signals or features for classification, presented high accuracy values (>0.96). Generalization of the network trained on simulated data was good for both patient datasets (accuracy of 0.86 and 0.84, respectively) and presented better values than using exclusively real ECGs for classification (accuracy of 0.84 and 0.76 for each dataset). The use of simulated ECG data for training machine learning-based classification algorithms is critical to obtain good SOO predictions in OTVA compared to real data alone. The fast implementation and generalization of the proposed methodology may contribute towards its application to a clinical routine.

  15. Z

    Spacecraft Thruster Firing Test Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith (2024). Spacecraft Thruster Firing Test Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7137929
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WARNING

    This version of the dataset is not recommended for anomaly detection use case. We discovered discrepancies in the anomalous sequences. A new version will be released. In the meantime, please ignore all sequence marked as anomalous.

    CONTEXT

    Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).

    The PDF document "STFT Dataset Description" describes in much details the structure, context, use cases and domain-knowledge about thruster in order for ML practitioners to use the dataset.

    PROPOSED TASKS

    Supervised:

    Performance Modelling: Prediction of the thruster performances (target can be thrust, mass flow rate, and/or the average specific impulse)

    Acceptance Test for Individualised Performance Model refinement: Taking into account the acceptance test of individual thruster might be helpful to generate individualised thruster predictive model

    Uncertainty Quantification for Thruster-to-thruster reproducibility verification, i.e. to evaluate the prediction variability between several thrusters in order to construct uncertainty bounds around the prediction (predictive intervals) of the thrust and mass flow rate of future thrusters that may be used during an actual space mission

    Unsupervised / Anomaly Detection

    Anomaly Detection: Anomalies can be detected in an unsupervised setting (outlier detection) or in a semi-supervised setting (novelty detection). The dataset includes a total of 270 anomalies. A simple approach is to predict if a firing test sequence is anomalous or nominal. A more advanced approach is trying to predict which portion of a time series is anomalous. The dataset also provide a detailed information about each time point being anomalous or nominal. In case of an anomaly, a code is provided which allows to diagnosis the detection system performance on the different types of anomalies contained in the dataset.

  16. AI And Machine Learning In Business Market Analysis, Size, and Forecast...

    • technavio.com
    pdf
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). AI And Machine Learning In Business Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (Australia, China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-and-machine-learning-in-business-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United Kingdom, Canada, United States
    Description

    Snapshot img

    AI And Machine Learning In Business Market Size 2025-2029

    The AI and machine learning in business market size is forecast to increase by USD 240.3 billion, at a CAGR of 24.9% between 2024 and 2029.

    The market is experiencing significant momentum, driven by the unprecedented advancements in AI technology and the proliferation of generative AI copilots and embedded AI in enterprise platforms. These developments are revolutionizing business processes, enabling automation, and enhancing productivity. However, the market faces a notable challenge: the scarcity of specialized talent required to effectively implement and manage these advanced technologies. As AI continues to evolve and become increasingly integral to business operations, there is an imperative for workforce transformation, necessitating a focus on upskilling and reskilling initiatives.
    Companies seeking to capitalize on market opportunities and navigate challenges effectively must prioritize talent development and collaboration with AI experts. The strategic landscape of this dynamic market presents both opportunities and obstacles, requiring agile and forward-thinking approaches. Additionally, edge computing solutions, data governance policies, and knowledge graph creation are essential for maintaining maintainability and ensuring regulatory compliance.
    

    What will be the Size of the AI And Machine Learning In Business Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    The artificial intelligence (AI) and machine learning (ML) market continues to evolve, with new applications and advancements emerging across various sectors. Businesses are increasingly leveraging AI-powered technologies to optimize their supply chains, enhancing efficiency and reducing costs. For instance, a leading retailer reported a 15% increase in on-time deliveries by implementing AI-driven supply chain optimization. Natural language processing (NLP) and generative adversarial networks (GANs) are transforming customer relationship management (CRM) and business process optimization. NLP tools enable companies to analyze customer interactions, improving customer service and personalizing marketing efforts. GANs, on the other hand, facilitate the creation of realistic synthetic data, enhancing the accuracy of ML models.
    Fraud detection systems and computer vision systems are revolutionizing risk management and data privacy regulations. Predictive maintenance, unsupervised learning methods, and time series forecasting help businesses maintain their infrastructure, while deep learning models and AI ethics considerations ensure data privacy and security. Moreover, AI-powered automation, predictive modeling techniques, and speech recognition software are streamlining operations and improving decision-making processes. Reinforcement learning applications, data mining processes, image recognition technology, and sentiment analysis tools further expand the potential of AI in business. According to recent industry reports, the global AI market is expected to grow by over 20% annually, underscoring its transformative potential.
    This continuous unfolding of market activities and evolving patterns underscores the importance of staying informed and adaptable for businesses looking to harness the power of AI and ML. A single example of the impact of AI in business: A manufacturing company reduced its maintenance costs by 12% by implementing predictive maintenance using machine learning algorithms and process mining techniques. This proactive approach to maintenance allowed the company to address potential issues before they escalated, saving time and resources.
    

    How is this AI And Machine Learning In Business Industry segmented?

    The AI and machine learning in business industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Component
    
      Solutions
      Services
    
    
    Sector
    
      Large enterprises
      SMEs
    
    
    Application
    
      Data analytics
      Predictive analytics
      Cyber security
      Supply chain and inventory management
      Others
    
    
    End-user
    
      IT and telecom
      BFSI
      Retail and manufacturing
      Healthcare
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        UK
    
    
      APAC
    
        Australia
        China
        India
        Japan
        South Korea
    
    
      Rest of World (ROW)
    

    By Component Insights

    The Solutions segment is estimated to witness significant growth during the forecast period. The AI and machine learning market in business continues to evolve, with significant advancements in various applications. Generative adversarial networks (GANs) are revolutionizing supply chain optimization, enabling more accurate forecasting and demand planning. In the realm of busine

  17. c

    Data from: Smart metering and energy access programs: an approach to energy...

    • esango.cput.ac.za
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bennour Bacar (2023). Smart metering and energy access programs: an approach to energy poverty reduction in sub-Saharan Africa [Dataset]. http://doi.org/10.25381/cput.22264042.v1
    Explore at:
    Dataset updated
    May 31, 2023
    Dataset provided by
    Cape Peninsula University of Technology
    Authors
    Bennour Bacar
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Africa, Sub-Saharan Africa
    Description

    Ethical clearance reference number: refer to the uploaded document Ethics Certificate.pdf.

    General (0)

    0 - Built diagrams and figures.pdf: diagrams and figures used for the thesis

    Analysis of country data (1)

    0 - Country selection.xlsx: In this analysis the sub-Saharan country (Niger) is selected based on the kWh per capita data obtained from sources such as the United Nations and the World Bank. Other data used from these sources includes household size and electricity access. Some household data was projected using linear regression. Sample sizes VS error margins were also analyzed for the selection of a smaller area within the country.

    Smart metering experiment (2)

    The figures (PNG, JPG, PDF) include:

        - The experiment components and assembly
        - The use of device (meter and modem) softwar tools to program and analyse data
        - Phasor and meter detail
        - Extracted reports and graphs from the MDMS
    

    The datasets (CSV, XLSX) include:

        - Energy load profile and register data recorded by the smart meter and collected by both meter configuration and MDM applications.
        - Data collected also includes events, alarm and QoS data.
    

    Data applicability to SEAP (3)

    3 - Energy data and SEAP.pdf: as part of the Smart Metering VS SEAP framework analysis, a comparison between SEAP's data requirements, the applicable energy data to those requirements, the benefits, and the calculation of indicators where applicable. 3 - SEAP indicators.xlsx: as part of the Smart Metering VS SEAP framework analysis, the applicable calculation of indicators for SEAP's data requirements.

    Load prediction by machine learning (4)

    The coding (IPYNB, PY, HTML, ZIP) shows the preparation and exploration of the energy data to train the machine learning model. The datasets (CSV, XLSX), sequentially named, are part of the process of extracting, transforming and loading the data into a machine learning algorithm, identifying the best regression model based on metrics, and predicting the data.

    HRES analysis and optimization (5)

    The figures (PNG, JPG, PDF) include:

        - Household load, based on the energy data from the smart metering experiment and the machine learning exercise
        - Pre-defined/synthetic load, provided by the software when no external data (household load) is available, and
        - The HRES designed
        - Application-generated reports with the results of the analysis, for both best case HRES and fully renewable scenarios.
    

    The datasets (XLSX) include the 12-month input load for the simulation, and the input/output analysis and calculations. 5 - Gorou_Niger_20220529_v3.homer: software (Homer Pro) file with the simulated HRES

    · Conferences (6)

    6 – IEEE_MISTA_2022_paper_51.pdf: paper (research in progress) presented at the IEEE MISTA 2022 conference, occurred in March-2022, and published in the respective proceeding, 6 - IEEE_MISTA_2022_proceeding.pdf. 6 - ITAS_2023.pdf: paper (final research) recently presented at the ITAS 2023 conference in Doha, Qatar, in March-2023. 6 - Smart Energy Seminar 2023.pptx: PowerPoint slide version of the paper, recently presented at the Smart Energy Seminar held at CPUT, in March-2023.

  18. f

    Data_Sheet_1_EVMP: enhancing machine learning models for synthetic promoter...

    • frontiersin.figshare.com
    pdf
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weiqin Yang; Dexin Li; Ranran Huang (2023). Data_Sheet_1_EVMP: enhancing machine learning models for synthetic promoter strength prediction by Extended Vision Mutant Priority framework.PDF [Dataset]. http://doi.org/10.3389/fmicb.2023.1215609.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Weiqin Yang; Dexin Li; Ranran Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionIn metabolic engineering and synthetic biology applications, promoters with appropriate strengths are critical. However, it is time-consuming and laborious to annotate promoter strength by experiments. Nowadays, constructing mutation-based synthetic promoter libraries that span multiple orders of magnitude of promoter strength is receiving increasing attention. A number of machine learning (ML) methods are applied to synthetic promoter strength prediction, but existing models are limited by the excessive proximity between synthetic promoters.MethodsIn order to enhance ML models to better predict the synthetic promoter strength, we propose EVMP(Extended Vision Mutant Priority), a universal framework which utilize mutation information more effectively. In EVMP, synthetic promoters are equivalently transformed into base promoter and corresponding k-mer mutations, which are input into BaseEncoder and VarEncoder, respectively. EVMP also provides optional data augmentation, which generates multiple copies of the data by selecting different base promoters for the same synthetic promoter.ResultsIn Trc synthetic promoter library, EVMP was applied to multiple ML models and the model effect was enhanced to varying extents, up to 61.30% (MAE), while the SOTA(state-of-the-art) record was improved by 15.25% (MAE) and 4.03% (R2). Data augmentation based on multiple base promoters further improved the model performance by 17.95% (MAE) and 7.25% (R2) compared with non-EVMP SOTA record.DiscussionIn further study, extended vision (or k-mer) is shown to be essential for EVMP. We also found that EVMP can alleviate the over-smoothing phenomenon, which may contributes to its effectiveness. Our work suggests that EVMP can highlight the mutation information of synthetic promoters and significantly improve the prediction accuracy of strength. The source code is publicly available on GitHub: https://github.com/Tiny-Snow/EVMP.

  19. r

    Handwritten synthetic dataset from the IAM

    • researchdata.edu.au
    • research-repository.rmit.edu.au
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa (2023). Handwritten synthetic dataset from the IAM [Dataset]. http://doi.org/10.25439/RMT.24309730.V1
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    RMIT University, Australia
    Authors
    Hiqmat Nisa
    Description

    This dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.

    The folder has:
    s-s0 images
    Syn-trainset
    Syn-validset
    Syn_IAM_testset
    The transcription files are in the format of
    Filename, threshold label of handwritten line
    s-s0-0,157 A # to stop Mr. Gaitskell from

    Cite the below work if you have used this dataset:
    "A deep learning approach to handwritten text recognition in the presence of struck-out text"
    https://ieeexplore.ieee.org/document/8961024


  20. Loan Approval Classification Dataset

    • kaggle.com
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ta-wei Lo (2024). Loan Approval Classification Dataset [Dataset]. https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ta-wei Lo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    1. Data Source

    This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.

    2. Metadata

    The dataset contains 45,000 records and 14 variables, each described below:

    ColumnDescriptionType
    person_ageAge of the personFloat
    person_genderGender of the personCategorical
    person_educationHighest education levelCategorical
    person_incomeAnnual incomeFloat
    person_emp_expYears of employment experienceInteger
    person_home_ownershipHome ownership status (e.g., rent, own, mortgage)Categorical
    loan_amntLoan amount requestedFloat
    loan_intentPurpose of the loanCategorical
    loan_int_rateLoan interest rateFloat
    loan_percent_incomeLoan amount as a percentage of annual incomeFloat
    cb_person_cred_hist_lengthLength of credit history in yearsFloat
    credit_scoreCredit score of the personInteger
    previous_loan_defaults_on_fileIndicator of previous loan defaultsCategorical
    loan_status (target variable)Loan approval status: 1 = approved; 0 = rejectedInteger

    3. Data Usage

    The dataset can be used for multiple purposes:

    • Exploratory Data Analysis (EDA): Analyze key features, distribution patterns, and relationships to understand credit risk factors.
    • Classification: Build predictive models to classify the loan_status variable (approved/not approved) for potential applicants.
    • Regression: Develop regression models to predict the credit_score variable based on individual and loan-related attributes.

    Mind the data issue from the original data, such as the instance > 100-year-old as age.

    This dataset provides a rich basis for understanding financial risk factors and simulating predictive modeling processes for loan approval and credit scoring.

    Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001

Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.

Search
Clear search
Close search
Google apps
Main menu