51 datasets found
  1. i

    Dataset of article: Synthetic Datasets Generator for Testing Information...

    • ieee-dataport.org
    Updated Mar 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandro Mendonça (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. http://doi.org/10.21227/5aeq-rr34
    Explore at:
    Dataset updated
    Mar 13, 2020
    Dataset provided by
    IEEE Dataport
    Authors
    Sandro Mendonça
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

  2. E

    Synthetic Data Generation Market Size, Share, Trend Analysis by 2033

    • emergenresearch.com
    pdf
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emergen Research (2024). Synthetic Data Generation Market Size, Share, Trend Analysis by 2033 [Dataset]. https://www.emergenresearch.com/industry-report/synthetic-data-generation-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    Emergen Research
    License

    https://www.emergenresearch.com/purpose-of-privacy-policyhttps://www.emergenresearch.com/purpose-of-privacy-policy

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    The Synthetic Data Generation Market size is expected to reach a valuation of USD 36.09 Billion in 2033 growing at a CAGR of 39.45%. The research report classifies market by share, trend, demand and based on segmentation by Data Type, Modeling Type, Offering, Application, End Use and Regional Outlook.

  3. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  4. Synthetic Data Generation Market Size, Share, Trends & Insights Report, 2035...

    • rootsanalysis.com
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roots Analysis (2024). Synthetic Data Generation Market Size, Share, Trends & Insights Report, 2035 [Dataset]. https://www.rootsanalysis.com/synthetic-data-generation-market
    Explore at:
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Authors
    Roots Analysis
    License

    https://www.rootsanalysis.com/privacy.htmlhttps://www.rootsanalysis.com/privacy.html

    Time period covered
    2021 - 2031
    Area covered
    Global
    Description

    The global synthetic data market size is projected to grow from USD 0.4 billion in the current year to USD 19.22 billion by 2035, representing a CAGR of 42.14%, during the forecast period till 2035

  5. Global Test Data Management Market Size By Component (Software/Solutions and...

    • verifiedmarketresearch.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH, Global Test Data Management Market Size By Component (Software/Solutions and Services), By Deployment Mode (Cloud-based and On-Premises), By Enterprise Level (Large Enterprises and SMEs), By Application (Synthetic Test Data Generation, Data Masking), By End User (BFSI, IT & telecom, Retail & Agriculture), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/test-data-management-market/
    Explore at:
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Test Data Management Market size was valued at USD 1.54 Billion in 2024 and is projected to reach USD 2.97 Billion by 2031, growing at a CAGR of 11.19% from 2024 to 2031.

    Test Data Management Market Drivers

    Increasing Data Volumes: The exponential growth in data generated by businesses necessitates efficient management of test data. Effective TDM solutions help organizations handle large volumes of data, ensuring accurate and reliable testing processes.

    Need for Regulatory Compliance: Stringent data privacy regulations, such as GDPR, HIPAA, and CCPA, require organizations to protect sensitive data. TDM solutions help ensure compliance by masking or anonymizing sensitive data used in testing environments.

    Adoption of DevOps and Agile Methodologies: The shift towards DevOps and Agile development practices increases the demand for TDM solutions. These methodologies require continuous testing and integration, necessitating efficient management of test data to maintain quality and speed.

  6. m

    Synthetic Data Generation Market Size | CAGR of 35.9%

    • market.us
    csv, pdf
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market.us (2025). Synthetic Data Generation Market Size | CAGR of 35.9% [Dataset]. https://market.us/report/synthetic-data-generation-market/
    Explore at:
    pdf, csvAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Market.us
    License

    https://market.us/privacy-policy/https://market.us/privacy-policy/

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    The Synthetic Data Generation Market is estimated to reach USD 6,637.9 Mn By 2034, Riding on a Strong 35.9% CAGR during forecast period.

  7. T

    A Study of the Synthetic Data Generation Market by Tabular Data and Direct...

    • futuremarketinsights.com
    pdf
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A Study of the Synthetic Data Generation Market by Tabular Data and Direct Modeling from 2024 to 2034 [Dataset]. https://www.futuremarketinsights.com/reports/synthetic-data-generation-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 8, 2024
    Dataset authored and provided by
    Future Market Insights
    License

    https://www.futuremarketinsights.com/privacy-policyhttps://www.futuremarketinsights.com/privacy-policy

    Time period covered
    2024 - 2034
    Area covered
    Worldwide
    Description

    The synthetic data generation market is projected to be worth US$ 300 million in 2024. The market is anticipated to reach US$ 13.0 billion by 2034. The market is further expected to surge at a CAGR of 45.9% during the forecast period 2024 to 2034.

    AttributesKey Insights
    Synthetic Data Generation Market Estimated Size in 2024US$ 300 million
    Projected Market Value in 2034US$ 13.0 billion
    Value-based CAGR from 2024 to 203445.9%

    Country-wise Insights

    CountriesForecast CAGRs from 2024 to 2034
    The United States46.2%
    The United Kingdom47.2%
    China46.8%
    Japan47.0%
    Korea47.3%

    Category-wise Insights

    CategoryCAGR through 2034
    Tabular Data45.7%
    Sandwich Assays45.5%

    Report Scope

    AttributeDetails
    Estimated Market Size in 2024US$ 0.3 billion
    Projected Market Valuation in 2034US$ 13.0 billion
    Value-based CAGR 2024 to 203445.9%
    Forecast Period2024 to 2034
    Historical Data Available for2019 to 2023
    Market AnalysisValue in US$ Billion
    Key Regions Covered
    • North America
    • Latin America
    • Western Europe
    • Eastern Europe
    • South Asia and Pacific
    • East Asia
    • The Middle East & Africa
    Key Market Segments Covered
    • Data Type
    • Modeling Type
    • Offering
    • Application
    • End Use
    • Region
    Key Countries Profiled
    • The United States
    • Canada
    • Brazil
    • Mexico
    • Germany
    • France
    • France
    • Spain
    • Italy
    • Russia
    • Poland
    • Czech Republic
    • Romania
    • India
    • Bangladesh
    • Australia
    • New Zealand
    • China
    • Japan
    • South Korea
    • GCC countries
    • South Africa
    • Israel
    Key Companies Profiled
    • Mostly AI
    • CVEDIA Inc.
    • Gretel Labs
    • Datagen
    • NVIDIA Corporation
    • Synthesis AI
    • Amazon.com, Inc.
    • Microsoft Corporation
    • IBM Corporation
    • Meta

  8. Synthea synthetic patient generator data in OMOP Common Data Model

    • registry.opendata.aws
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
    Explore at:
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079

  9. Synthetic nursing handover training and development data set - text files

    • data.csiro.au
    • researchdata.edu.au
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - text files [Dataset]. http://doi.org/10.4225/08/58d097ee92e95
    Explore at:
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Dataset funded by
    NICTAhttp://nicta.com.au/
    Description

    This is one of two collection records. Please see the link below for the other collection of associated audio files.

    Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

    This collection contains 3 sets of text documents.

    Data Set 1 for Training and Development

    The data set, released in June 2014, includes the following documents:

    Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

    An Independent Data Set 2

    The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

    The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

    An Independent Data Set 3

    For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

    Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

    See Suominen et al (2015) in the links below for a detailed description and examples.

  10. d

    Data from: Generation of synthetic whole-slide image tiles of tumours from...

    • search-dev.test.dataone.org
    • search.dataone.org
    • +2more
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Tarak Nath Nandi; Ravi Madduri; Jeanne Shen; Olivier Gevaert (2024). Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models [Dataset]. http://doi.org/10.5061/dryad.6djh9w174
    Explore at:
    Dataset updated
    Apr 12, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Tarak Nath Nandi; Ravi Madduri; Jeanne Shen; Olivier Gevaert
    Time period covered
    Jan 1, 2023
    Description

    Data scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single-modality settings, such as whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing RNA-to-image synthesis in a multi-cancer context, drawing inspiration from successful text-to-image synthesis models used in natural images. In our approach, we employ a variational auto-encoder to reduce the dimensionality of a patient’s gene expression profile, effectively distinguishing between different types of cancer. Subsequently, we employ a cascad..., , , # RNA-CDM Generated One Million Synthetic Images

    https://doi.org/10.5061/dryad.6djh9w174

    One million synthetic digital pathology images were generated using the RNA-CDM model presented in the paper "RNA-to-image multi-cancer synthesis using cascaded diffusion models".

    Description of the data and file structure

    There are ten different h5 files per cancer type (TCGA-CESC, TCGA-COAD, TCGA-KIRP, TCGA-GBM, TCGA-LUAD). Each h5 file contains 20.000 images. The key is the tile number, ranging from 0-20,000 in the first file, and from 180,000-200,000 in the last file. The tiles are saved as numpy arrays.

    Code/Software

    The code used to generate this data is available under academic license in https://rna-cdm.stanford.edu .

    Manuscript citation

    Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models...

  11. c

    Insider Threat Test Dataset

    • kilthub.cmu.edu
    application/bzip2 +3
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Lindauer (2023). Insider Threat Test Dataset [Dataset]. http://doi.org/10.1184/R1/12841247.v1
    Explore at:
    bz2, bin, application/bzip2, txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Carnegie Mellon University
    Authors
    Brian Lindauer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Insider Threat Test Dataset is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data.

    The CERT Division, in partnership with ExactData, LLC, and under sponsorship from DARPA I2O, generated a collection of synthetic insider threat test datasets. These datasets provide both synthetic background data and data from synthetic malicious actors. For more background on this data, please see the paper, Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data. Datasets are organized according to the data generator release that created them. Most releases include multiple datasets (e.g., r3.1 and r3.2). Generally, later releases include a superset of the data generation functionality of earlier releases. Each dataset file contains a readme file that provides detailed notes about the features of that release. The answer key file answers.tar.bz2 contains the details of the malicious activity included in each dataset, including descriptions of the scenarios enacted and the identifiers of the synthetic users involved.

  12. q

    Test Sequence 000013

    • data.researchdatafinder.qut.edu.au
    Updated Sep 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Test Sequence 000013 [Dataset]. https://data.researchdatafinder.qut.edu.au/es/dataset/0e08ddc4-ea84-4a27-904b-4965cdeabb44/resource/202ce8d4-a48c-4f57-bd23-b85288deda94
    Explore at:
    Dataset updated
    Sep 4, 2024
    License

    https://researchdatafinder.qut.edu.au/display/n6417https://researchdatafinder.qut.edu.au/display/n6417

    Description

    Test data created for the ACRV Robotic Vision Challenge 1. See https://competitions.codalab.org/competitions/20940. Synthetic image data generated from Unreal Engine QUT Research Data Respository Dataset Resource available for download

  13. m

    Test Data Management Market Size, Share, Trends, Scope And Forecast

    • marketresearchintellect.com
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Test Data Management Market Size, Share, Trends, Scope And Forecast [Dataset]. https://www.marketresearchintellect.com/product/global-test-data-management-market-size-forecast/
    Explore at:
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    The size and share of the market is categorized based on Type (Implementation, Consulting, Support and Maintenance) and Application (Data subsetting, Data masking, Data profiling and analysis, Data compliance and security, Synthetic test data generation, Others) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).

  14. g

    SYNTHETIC Norwegian Colorectal Cancer genomic dataset generated in...

    • catalogue.portal.dev.gdi.lu
    • ckan-test.healthdata.nl
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SYNTHETIC Norwegian Colorectal Cancer genomic dataset generated in EOSC4Cancer - Dataset - CKAN [Dataset]. https://catalogue.portal.dev.gdi.lu/dataset/synthetic-norwegian-colorectal-cancer-genomic-dataset-generated-in-eosc4cancer
    Explore at:
    Dataset updated
    Apr 30, 2024
    Description

    SYNTHETIC This dataset contains 10 tumor and normal pairs synthetic WGS data of colorectal cancer that were simulated in a standard format of Illumina paired-end reads. NEAT read simulator (version 3.0, https://github.com/zstephens/neat-genreads) is utilized to synthetize these 10 pairs tumor and normal WGS data. In the procedure of data generation, simulated parameters (i.e., sequencing error statistics, read fragment length distribution and GC% coverage bias) are learned from data models provided by NEAT. The average sequencing depth for tumor and normal samples aims to reach around 110X and 60X, respectively. For generation of synthetic normal WGS data per each sample, a germline variant profile from a real patient is down-sampled randomly, which includes 50% germline variants of such a patient. It is then mixed together with an in silico germline variant profile that is modelled randomly using an average mutation rate (0.001), finally constituting a full germline profile for normal synthetic WGS data. For generation of synthetic tumor WGS data per each sample, a pre-defined somatic short variant profile (SNVs+Indels) learn from a real CRC patient is added to the germline variant profile used for creating normal synthetic WGS data of the same patient, which is utilized to produce simulated sequences. Neither copy number profile nor structural variation profile is introduced into the tumor synthetic WGS data. Tumor content and ploidy are assumed to be 100% and 2.

  15. Data from: MS Ana: Improving Sensitivity in Peptide Identification with...

    • acs.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Dorl; Stephan Winkler; Karl Mechtler; Viktoria Dorfer (2023). MS Ana: Improving Sensitivity in Peptide Identification with Spectral Library Search [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00658.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Sebastian Dorl; Stephan Winkler; Karl Mechtler; Viktoria Dorfer
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Spectral library search can enable more sensitive peptide identification in tandem mass spectrometry experiments. However, its drawbacks are the limited availability of high-quality libraries and the added difficulty of creating decoy spectra for result validation. We describe MS Ana, a new spectral library search engine that enables high sensitivity peptide identification using either curated or predicted spectral libraries as well as robust false discovery control through its own decoy library generation algorithm. MS Ana identifies on average 36% more spectrum matches and 4% more proteins than database search in a benchmark test on single-shot human cell-line data. Further, we demonstrate the quality of the result validation with tests on synthetic peptide pools and show the importance of library selection through a comparison of library search performance with different configurations of publicly available human spectral libraries.

  16. g

    Synthetic datasets

    • generated.photos
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Generated Media, Inc. (2024). Synthetic datasets [Dataset]. https://generated.photos/datasets
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    Generated Media, Inc.
    Description

    100% synthetic. Based on model-released photos. Can be used for any purpose except for the ones violating the law. Worldwide. Different backgrounds: colored, transparent, photographic. Diversity: ethnicity, demographics, facial expressions, and poses.

  17. d

    INDIGO Change Detection Reference Dataset - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Dec 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). INDIGO Change Detection Reference Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/b1c3a103-a208-5983-9784-e866b64aee1e
    Explore at:
    Dataset updated
    Dec 21, 2023
    Description

    The INDIGO Change Detection Reference Dataset Description This graffiti-centred change detection dataset was developed in the context of INDIGO, a research project focusing on the documentation, analysis and dissemination of graffiti along Vienna's Donaukanal. The dataset aims to support the development and assessment of change detection algorithms. The dataset was collected from a test site approximately 50 meters in length along Vienna's Donaukanal during 11 days between 2022/10/21 and 2022/12/01. Various cameras with different settings were used, resulting in a total of 29 data collection sessions or "epochs" (see "EpochIDs.jpg" for details). Each epoch contains 17 images generated from 29 distinct 3D models with different textures. In total, the dataset comprises 6,902 unique image pairs, along with corresponding reference change maps. Additionally, exclusion masks are provided to ignore parts of the scene that might be irrelevant, such as the background. To summarise, the dataset, labelled as "Data.zip," includes the following: Synthetic Images: These are colour images created within Agisoft Metashape Professional 1.8.4, generated by rendering views from 17 artificial cameras observing 29 differently textured versions of the same 3D surface model. Change Maps: Binary images that were manually and programmatically generated, using a Python script, from two synthetic graffiti images. These maps highlight the areas where changes have occurred. Exclusion Masks: Binary images are manually created from synthetic graffiti images to identify "no data" areas or irrelevant ground pixels. Image Acquisition Image acquisition involved the use of two different camera setups. The first two datasets (ID 1 and 2; cf. "EpochIDs.jpg") were obtained using a Nikon Z 7II camera with a pixel count of 45.4 MP, paired with a Nikon NIKKOR Z 20 mm lens. For the remaining image datasets (ID 3-29), a triple GoPro setup was employed. This triple setup featured three GoPro cameras, comprising two GoPro HERO 10 cameras and one GoPro HERO 11, all securely mounted within a frame. This triple-camera setup was utilised on nine different days with varying camera settings, resulting in the acquisition of 27 image datasets in total (nine days with three datasets each). Data Structure The "Data.zip" file contains two subfolders: 1_ImagesAndChangeMaps: This folder contains the primary dataset. Each subfolder corresponds to a specific epoch. Within each epoch folder resides a subfolder for every other epoch with which a distinct epoch pair can be created. It is important to note that the pairs "Epoch Y and Epoch Z" are equivalent to "Epoch Z and Epoch Y", so the latter combinations are not included in this dataset. Each sub-subfolder, organised by epoch, contains 17 more subfolders, which hold the image data. These subfolders consist of: Two synthetic images rendered from the same synthetic camera ("X_Y.jpg" and "X_Z.jpg") The corresponding binary reference change map depicting the graffiti-related differences between the two images ("X_YZ.png"). Black areas denote new graffiti (i.e. "change"), and white denotes "no change". "DataStructure.png" provides a visual explanation concerning the creation of the dataset. The filenames follow the following pattern: X - Is the ID number of the synthetic camera. In total, 17 synthetic cameras were placed along the test site Y - Corresponds to the reference epoch (i.e. the "older epoch") Z - Corresponds to the "new epoch" 2_ExclusionMasks: This folder contains the binary exclusion masks. They were manually created from synthetic graffiti images and identify "no data" areas or areas considered irrelevant, such as "ground pixels". Two exclusion masks were generated for each of the 17 synthetic cameras: "groundMasks": depict ground pixels which are usually irrelevant for the detection of graffiti "noDataMasks": depict "background" for which no data is available. A detailed dataset description (including detailed explanations of the data creation) is part of a journal paper currently in preparation. The paper will be linked here for further clarification as soon as it is available. Licensing Due to the nature of the three image types, this dataset comes with two licenses: Synthetic images: These come with an In Copyright license (for the rights usage terms, see https://rightsstatements.org/page/InC/1.0/?language=en). The copyright lies with: the Ludwig Boltzmann Gesellschaft (https://d-nb.info/gnd/1024204324) the TU Wien (https://d-nb.info/gnd/55426-1) One or more anonymous graffiti creator(s) upon whose work these images are based. The first two entities are also the licensor of these images. Change maps and masks: These are openly licensed via CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) In this case, the copyright lies with: the Ludwig Boltzmann Gesellschaft (https://d-nb.info/gnd/1024204324) the TU Wien (https://d-nb.info/gnd/55426-1) Both institutes are also the licensor of these images. Every synthetic image, change map and mask has this licensing information embedded as IPTC photo metadata. In addition, the images' IPTC metadata also provide a short image description, the image creator and the creator's identity (in the form of an ORCiD).

  18. m

    Global Test Data Management TDM Market Size, Trends and Projections

    • marketresearchintellect.com
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2024). Global Test Data Management TDM Market Size, Trends and Projections [Dataset]. https://www.marketresearchintellect.com/product/test-data-management-tdm-market/
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    The size and share of the market is categorized based on Type (Implementation, Consulting, Support and Maintenance, Training and Education) and Application (Data subsetting, Data masking, Data profiling and analysis, Data compliance and security, Synthetic test data generation, Others (data provisioning and data monitoring)) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).

  19. c

    Synthetic Population for Agent-based Modelling in Canada, 2016-2030

    • datacatalogue.cessda.eu
    • beta.ukdataservice.ac.uk
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manley, E; Predhumeau, M (2025). Synthetic Population for Agent-based Modelling in Canada, 2016-2030 [Dataset]. http://doi.org/10.5255/UKDA-SN-857535
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    University of Leeds
    Authors
    Manley, E; Predhumeau, M
    Time period covered
    Feb 1, 2020 - Jan 31, 2024
    Area covered
    Canada
    Variables measured
    Geographic Unit
    Measurement technique
    Synthetic population data projections, derived from Canadian census data.
    Description

    In order to anticipate the impact of local public policies, a synthetic population reflecting the characteristics of the local population provides a valuable test bed. While synthetic population datasets are now available for several countries, there is no open-source synthetic population for Canada. We propose an open-source synthetic population of individuals and households at a fine geographical level for Canada for the years 2021, 2023 and 2030. Based on 2016 census data and population projections, the synthetic individuals have detailed socio-demographic attributes, including age, sex, income, education level, employment status and geographic locations, and are related into households. A comparison of the 2021 synthetic population with 2021 census data over various geographical areas validates the reliability of the synthetic dataset. Users can extract populations from the dataset for specific zones, to explore ‘what if’ scenarios on present and future populations. They can extend the dataset using local survey data to add new characteristics to individuals. Users can also run the code to generate populations for years up to 2042.

    To capture the full social and economic benefits of AI, new technologies must be sensitive to the diverse needs of the whole population. This means understanding and reflecting the complexity of individual needs, the variety of perceptions, and the constraints that might guide interaction with AI. This challenge is no more relevant than in building AI systems for older populations, where the role, potential, and outstanding challenges are all highly significant.

    The RAIM (Responsible Automation for Inclusive Mobility) project will address how on-demand, electric autonomous vehicles (EAVs) might be integrated within public transport systems in the UK and Canada to meet the complex needs of older populations, resulting in improved social, economic, and health outcomes. The research integrates a multidisciplinary methodology - integrating qualitative perspectives and quantitative data analysis into AI-generated population simulations and supply optimisation. Throughout the project, there is a firm commitment to interdisciplinary interaction and learning, with researchers being drawn from urban geography, ageing population health, transport planning and engineering, and artificial intelligence.

    The RAIM project will produce a diverse set of outputs that are intended to promote change and discussion in transport policymaking and planning. As a primary goal, the project will simulate and evaluate the feasibility of an on-demand EAV system for older populations. This requires advances around the understanding and prediction of the complex interaction of physical and cognitive constraints, preferences, locations, lifestyles and mobility needs within older populations, which differs significantly from other portions of society. With these patterns of demand captured and modelled, new methods for meeting this demand through optimisation of on-demand EAVs will be required. The project will adopt a forward-looking, interdisciplinary approach to the application of AI within these research domains, including using Deep Learning to model human behaviour, Deep Reinforcement Learning to optimise the supply of EAVs, and generative modelling to estimate population distributions.

    A second component of the research involves exploring the potential adoption of on-demand EAVs for ageing populations within two regions of interest. The two areas of interest - Manitoba, Canada, and the West Midlands, UK - are facing the combined challenge of increasing older populations with service issues and reducing patronage on existing services for older travellers. The RAIM project has established partnerships with key local partners, including local transport authorities - Winnipeg Transit in Canada, and Transport for West Midlands in the UK - in addition to local support groups and industry bodies. These partnerships will provide insights and guidance into the feasibility of new AV-based mobility interventions, and a direct route to influencing future transport policy. As part of this work, the project will propose new approaches for assessing the economic case for transport infrastructure investment, by addressing the wider benefits of improved mobility in older populations.

    At the heart of the project is a commitment to enhancing collaboration between academic communities in the UK and Canada. RAIM puts in place opportunities for cross-national learning and collaboration between partner organisations, ensuring that the challenges faced in relation to ageing mobility and AI are shared. RAIM furthermore will support the development of a next generation of researchers, through interdisciplinary mentoring, training, and networking opportunities.

  20. u

    Data from: Synthetic realistic noise-corrupted PPG database and noise...

    • produccioncientifica.ucm.es
    • zenodo.org
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Masinelli, Giulio; Dell'Agnola, Fabio; Valdés, Adriana; Atienza, David; Masinelli, Giulio; Dell'Agnola, Fabio; Valdés, Adriana; Atienza, David (2021). Synthetic realistic noise-corrupted PPG database and noise generator for the evaluation of PPG denoising and delineation algorithms [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc441b9e7c03b01bd7ece
    Explore at:
    Dataset updated
    2021
    Authors
    Masinelli, Giulio; Dell'Agnola, Fabio; Valdés, Adriana; Atienza, David; Masinelli, Giulio; Dell'Agnola, Fabio; Valdés, Adriana; Atienza, David
    Description

    Overview This database is meant to evaluate the performance of denoising and delineation algorithms for PPG signals affected by noise. The noise generator allows applying the algorithms under test to an artificially corrupted reference PPG signal and comparing its output to the output obtained with the original signal. Moreover, the noise generator can produce artifacts of variable intensities, permitting the evaluation of the algorithms' performance against different noise levels. The reference signal is a PPG sample of a healthy subject at rest during a relaxing session. Database The database includes 1 recording of 72 seconds of synchronous PPG and ECG signals sampled at 250 Hz using a Medicom device, ABP-10 module (Medicom MTD Ltd., Russia). It was collected from a healthy subject during an induced relaxation by guided autogenic relaxation. For more information about the data collection, please refer to the following publication: https://pubmed.ncbi.nlm.nih.gov/30094756/ In addition, PPG signals corrupted by the noise generator at different levels are also included in the database. Realistic noise generator Motion Artifacts in PPG signals generally appear in the form of sudden spikes (in correspondence to the subject's movement) and slowly varying offsets (baseline wander) due to the changes in distance between the skin and the sensor after every sudden movement. For this reason, conventional noise generators — using random noise drawn from different distributions such as Gaussian or Poissonian — do not allow to properly evaluate the algorithm's performance, as they can only provide unrealistic noises compared to the one commonly found in PPG signals. To overcome this issue, we designed a more realistic synthetic noise generator that can simulate those two behaviors, enabling us to corrupt a reference signal with different noise levels. The details about noise generation are available in the reference paper. Data Files The reference PPG signal can be found in Datasets\GoodSignals\PPG and the simultaneously acquired ECG in Datasets\GoodSignals\ECG. The folder Datasets\NoisySignals contains 340 noisy PPG signals affected by different levels of noise. The names describe the intensity of the noise (evaluated in terms of the standard deviation of the random noise used as input for the noise generator, see reference paper). Five noisy signals are produced for every noise level by running the noise generator with five random seeds each (for noise generation). Name convention: ppg_stdx_y denotes the y-th noisy PPG signal produced using a noise with a standard deviation of x. Datasets\BPMs contains the ground truth for the heart-rate estimation computed in windows of 8s with an overlap of 2s. Code The folder Code contains the MATLAB scripts to generate the noisy files by generating the realistic noise with the function noiseGenerator. When referencing this material, please cite: Masinelli, G.; Dell'Agnola, F.; Valdés, A.A.; Atienza, D. SPARE: A Spectral Peak Recovery Algorithm for PPG Signals Pulsewave Reconstruction in Multimodal Wearable Devices. Sensors 2021, 21, 2725. https://doi.org/10.3390/s21082725

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sandro Mendonça (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. http://doi.org/10.21227/5aeq-rr34

Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools

Explore at:
Dataset updated
Mar 13, 2020
Dataset provided by
IEEE Dataport
Authors
Sandro Mendonça
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

Search
Clear search
Close search
Google apps
Main menu