100+ datasets found
  1. i

    Dataset of article: Synthetic Datasets Generator for Testing Information...

    • ieee-dataport.org
    Updated Mar 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Santos (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. https://ieee-dataport.org/open-access/dataset-article-synthetic-datasets-generator-testing-information-visualization-and
    Explore at:
    Dataset updated
    Mar 13, 2020
    Authors
    Carlos Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

  2. f

    SynthAer - a synthetic dataset of semantically annotated aerial images

    • figshare.com
    zip
    Updated Sep 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Scanlon (2018). SynthAer - a synthetic dataset of semantically annotated aerial images [Dataset]. http://doi.org/10.6084/m9.figshare.7083242.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 13, 2018
    Dataset provided by
    figshare
    Authors
    Maria Scanlon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SynthAer is a dataset consisting of synthetic aerial images with pixel-level semantic annotations from a suburban scene generated using the 3D modelling tool Blender. SynthAer contains three time-of-day variations for each image - one for lighting conditions at dawn, one for midday, and one for dusk.

  3. u

    Unimelb Corridor Synthetic dataset

    • figshare.unimelb.edu.au
    png
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Debaditya Acharya; KOUROSH KHOSHELHAM; STEPHAN WINTER (2023). Unimelb Corridor Synthetic dataset [Dataset]. http://doi.org/10.26188/5dd8b8085b191
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    The University of Melbourne
    Authors
    Debaditya Acharya; KOUROSH KHOSHELHAM; STEPHAN WINTER
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data-set is a supplementary material related to the generation of synthetic images of a corridor in the University of Melbourne, Australia from a building information model (BIM). This data-set was generated to check the ability of deep learning algorithms to learn task of indoor localisation from synthetic images, when being tested on real images. =============================================================================The following is the name convention used for the data-sets. The brackets show the number of images in the data-set.REAL DATAReal
    ---------------------> Real images (949 images)

    Gradmag-Real -------> Gradmag of real data (949 images)SYNTHETIC DATASyn-Car
    ----------------> Cartoonish images (2500 images)

    Syn-pho-real ----------> Synthetic photo-realistic images (2500 images)

    Syn-pho-real-tex -----> Synthetic photo-realistic textured (2500 images)

    Syn-Edge --------------> Edge render images (2500 images)

    Gradmag-Syn-Car ---> Gradmag of Cartoonish images (2500 images)=============================================================================Each folder contains the images and their respective groundtruth poses in the following format [ImageName X Y Z w p q r].To generate the synthetic data-set, we define a trajectory in the 3D indoor model. The points in the trajectory serve as the ground truth poses of the synthetic images. The height of the trajectory was kept in the range of 1.5–1.8 m from the floor, which is the usual height of holding a camera in hand. Artificial point light sources were placed to illuminate the corridor (except for Edge render images). The length of the trajectory was approximately 30 m. A virtual camera was moved along the trajectory to render four different sets of synthetic images in Blender*. The intrinsic parameters of the virtual camera were kept identical to the real camera (VGA resolution, focal length of 3.5 mm, no distortion modeled). We have rendered images along the trajectory at 0.05 m interval and ± 10° tilt.The main difference between the cartoonish (Syn-car) and photo-realistic images (Syn-pho-real) is the model of rendering. Photo-realistic rendering is a physics-based model that traces the path of light rays in the scene, which is similar to the real world, whereas the cartoonish rendering roughly traces the path of light rays. The photorealistic textured images (Syn-pho-real-tex) were rendered by adding repeating synthetic textures to the 3D indoor model, such as the textures of brick, carpet and wooden ceiling. The realism of the photo-realistic rendering comes at the cost of rendering times. However, the rendering times of the photo-realistic data-sets were considerably reduced with the help of a GPU. Note that the naming convention used for the data-sets (e.g. Cartoonish) is according to Blender terminology.An additional data-set (Gradmag-Syn-car) was derived from the cartoonish images by taking the edge gradient magnitude of the images and suppressing weak edges below a threshold. The edge rendered images (Syn-edge) were generated by rendering only the edges of the 3D indoor model, without taking into account the lighting conditions. This data-set is similar to the Gradmag-Syn-car data-set, however, does not contain the effect of illumination of the scene, such as reflections and shadows.*Blender is an open-source 3D computer graphics software and finds its applications in video games, animated films, simulation and visual art. For more information please visit: http://www.blender.orgPlease cite the papers if you use the data-set:1) Acharya, D., Khoshelham, K., and Winter, S., 2019. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Photogrammetry and Remote Sensing. 150: 245-258.2) Acharya, D., Singha Roy, S., Khoshelham, K. and Winter, S. 2019. Modelling uncertainty of single image indoor localisation using a 3D model and deep learning. In ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, IV-2/W5, pages 247-254.

  4. o

    pGAN Synthetic Dataset: A Deep Learning Approach to Private Data Sharing of...

    • explore.openaire.eu
    • zenodo.org
    Updated Jun 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Plawinski; Hanxi Sun; Sajanth Subramaniam; Amir Jamaludin; Timor Kadir; Aimee Readie; Gregory Ligozio; David Ohlssen; Thibaud Coroller; Mark Baillie (2021). pGAN Synthetic Dataset: A Deep Learning Approach to Private Data Sharing of Medical Images Using Conditional GANs [Dataset]. http://doi.org/10.5281/zenodo.5031880
    Explore at:
    Dataset updated
    Jun 25, 2021
    Authors
    Jason Plawinski; Hanxi Sun; Sajanth Subramaniam; Amir Jamaludin; Timor Kadir; Aimee Readie; Gregory Ligozio; David Ohlssen; Thibaud Coroller; Mark Baillie
    Description

    Synthetic dataset for A Deep Learning Approach to Private Data Sharing of Medical Images Using Conditional GANs Dataset specification: MRI images of Vertebral Units labelled based on region Dataset is comprised of 10000 pairs of images and labels Image and label pair number k can be selected by: synthetic_dataset['images'][k] and synthetic_dataset['regions'][k] Images are 3D of size (9, 64, 64) Regions are stored as an integer. Mapping is 0: cervical, 1: thoracic, 2: lumbar Arxiv paper: https://arxiv.org/abs/2106.13199 Github code: https://github.com/tcoroller/pGAN/ Abstract: Sharing data from clinical studies can facilitate innovative data-driven research and ultimately lead to better public health. However, sharing biomedical data can put sensitive personal information at risk. This is usually solved by anonymization, which is a slow and expensive process. An alternative to anonymization is sharing a synthetic dataset that bears a behaviour similar to the real data but preserves privacy. As part of the collaboration between Novartis and the Oxford Big Data Institute, we generate a synthetic dataset based on COSENTYX Ankylosing Spondylitis (AS) clinical study. We apply an Auxiliary Classifier GAN (ac-GAN) to generate synthetic magnetic resonance images (MRIs) of vertebral units (VUs). The images are conditioned on the VU location (cervical, thoracic and lumbar). In this paper, we present a method for generating a synthetic dataset and conduct an in-depth analysis on its properties of along three key metrics: image fidelity, sample diversity and dataset privacy.

  5. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  6. h

    Synthetic dataset - Using data-driven ML towards improving diagnosis of ACS

    • healthdatagateway.org
    unknown
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2023). Synthetic dataset - Using data-driven ML towards improving diagnosis of ACS [Dataset]. https://healthdatagateway.org/dataset/138
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Oct 9, 2023
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    Background Acute compartment syndrome (ACS) is an emergency orthopaedic condition wherein a rapid rise in compartmental pressure compromises blood perfusion to the tissues leading to ischaemia and muscle necrosis. This serious condition is often misdiagnosed or associated with significant diagnostic delay, and can lead to limb amputations and death.

    The most common causes of ACS are high impact trauma, especially fractures of the lower limbs which account for 40% of ACS cases. ACS is a challenge to diagnose and treat effectively, with differing clinical thresholds being utilised which can result in unnecessary osteotomy. The highly granular synthetic data for over 900 patients with ACS provide the following key parameters to support critical research into this condition:

    1. Patient data (injury type, location, age, sex, pain levels, pre-injury status and comorbidities)
    2. Physiological parameters (intracompartmental pressure, pH, tissue oxygenation, compartment hardness)
    3. Muscle biomarkers (creatine kinase, myoglobin, lactate dehydrogenase)
    4. Blood vessel damage biomarkers (glycocalyx shedding markers, endothelial permeability markers)

    PIONEER geography: The West Midlands (WM) has a population of 5.9 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & an expanded 250 ITU bed capacity during COVID. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.

    Scope: Enabling data-driven research and machine learning models towards improving the diagnosis of Acute compartment syndrome. Longitudinal & individually linked, so that the preceding & subsequent health journey can be mapped & healthcare utilisation prior to & after admission understood. The dataset includes highly granular patient demographics, physiological parameters, muscle biomarkers, blood biomarkers and co-morbidities taken from ICD-10 & SNOMED-CT codes. Serial, structured data pertaining to process of care (timings and admissions), presenting complaint, lab analysis results (eGFR, troponin, CRP, INR, ABG glucose), systolic and diastolic blood pressures, procedures and surgery details.

    Available supplementary data: ACS cohort, Matched controls; ambulance, OMOP data. Available supplementary support: Analytics, Model build, validation & refinement; A.I.; Data partner support for ETL (extract, transform & load) process, Clinical expertise, Patient & end-user access, Purchaser access, Regulatory requirements, Data-driven trials, “fast screen” services.

  7. i

    Data from: Virtual SAR: A Synthetic Dataset for Deep Learning based Speckle...

    • ieee-dataport.org
    Updated Jun 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrey Dabhi (2020). Virtual SAR: A Synthetic Dataset for Deep Learning based Speckle Noise Reduction Algorithms [Dataset]. https://ieee-dataport.org/open-access/virtual-sar-synthetic-dataset-deep-learning-based-speckle-noise-reduction-algorithms
    Explore at:
    Dataset updated
    Jun 12, 2020
    Authors
    Shrey Dabhi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    we propose a standard synthetic data set for the training of speckle reduction algorithms.

  8. Data from: Synthetic time series data generation for edge analytics

    • zenodo.org
    bin
    Updated Nov 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subarmaniam Kannan; Subarmaniam Kannan (2021). Synthetic time series data generation for edge analytics [Dataset]. http://doi.org/10.5281/zenodo.5673806
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 25, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Subarmaniam Kannan; Subarmaniam Kannan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this research, we create synthetic data with features that are like data from IoT devices. We use an existing air quality dataset that includes temperature and gas sensor measurements. This real-time dataset includes component values for the Air Quality Index (AQI) and ppm concentrations for various polluting gas concentrations. We build a JavaScript Object Notation (JSON) model to capture the distribution of variables and structure of this real dataset to generate the synthetic data. Based on the synthetic dataset and original dataset, we create a comparative predictive model. Analysis of synthetic dataset predictive model shows that it can be successfully used for edge analytics purposes, replacing real-world datasets. There is no significant difference between the real-world dataset compared the synthetic dataset. The generated synthetic data requires no modification to suit the edge computing requirements. The framework can generate correct synthetic datasets based on JSON schema attributes. The accuracy, precision, and recall values for the real and synthetic datasets indicate that the logistic regression model is capable of successfully classifying data

  9. Z

    Cardiovascular synthetic tabular data

    • data.niaid.nih.gov
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arroni del Riego, Sergio (2024). Cardiovascular synthetic tabular data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12575888
    Explore at:
    Dataset updated
    Jul 8, 2024
    Dataset authored and provided by
    Arroni del Riego, Sergio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is focuses on cardiovascular diseases. It is generated using a hybrid machine learning model that combines diffusion models with Transformers, emphasizing data privacy. The dataset has been meticulously validated for quality and utility, yielding auspicious results.Validation and Metrics:The dataset has undergone rigorous validation processes to ensure quality, utility, and privacy. These validations involved:

    Distance to the Closest Record (DCR): The dataset achieved a DCR of 1.2879. The DCR is a metric that measures the distance of the generated data to the closest record in the original dataset. A higher DCR indicates that the synthetic data closely mirrors the real data in terms of statistical properties, making it reliable for further analysis and research.

    Membership Inference Attack Accuracy: The dataset scored 0.6780 in this metric. Membership inference attack accuracy measures the likelihood of correctly inferring whether a particular data point was part of the training dataset. An accuracy of 0.6780 suggests that the model maintains a strong level of privacy. It is important to note that a score of 0.5 would indicate random guessing, hence the achieved score demonstrates significantly better privacy protection than random predictions.

    Statistical Tests: Comprehensive statistical tests were conducted to compare the synthetic data with real data. These tests ensure that the synthetic data has similar statistical properties and distributions to the original data.

    Machine Learning Efficiency: The utility of the dataset was also validated using machine learning models to ensure that the synthetic data is effective for training and can produce reliable predictive models. The results showed that models trained on this dataset performed well, reinforcing the practical utility of the data.

    The high DCR value and the membership inference attack accuracy highlight the balance between data utility and privacy, making this dataset an invaluable resource for researchers and practitioners focusing on cardiovascular diseases and machine learning.

  10. Tuberculosis X-Ray Dataset (Synthetic)

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Dataset Summary

    This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

    💡 Context

    Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

    🗃️ Dataset Details

    • Number of Rows: 20,000
    • Number of Columns: 15
    • File Format: CSV
    • Resolution: Simulated patient data, not real X-ray images
    • Size: Approximately 10 MB

    🏷️ Columns and Descriptions

    Column NameDescription
    Patient_IDUnique ID for each patient (e.g., PID000001)
    AgeAge of the patient (in years)
    GenderGender of the patient (Male/Female)
    Chest_PainPresence of chest pain (Yes/No)
    Cough_SeveritySeverity of cough (Scale: 0-9)
    BreathlessnessSeverity of breathlessness (Scale: 0-4)
    FatigueLevel of fatigue experienced (Scale: 0-9)
    Weight_LossWeight loss (in kg)
    FeverLevel of fever (Mild, Moderate, High)
    Night_SweatsWhether night sweats are present (Yes/No)
    Sputum_ProductionLevel of sputum production (Low, Medium, High)
    Blood_in_SputumPresence of blood in sputum (Yes/No)
    Smoking_HistorySmoking status (Never, Former, Current)
    Previous_TB_HistoryPrevious tuberculosis history (Yes/No)
    ClassTarget variable indicating the condition (Normal, Tuberculosis)

    🔍 Data Generation Process

    The dataset was generated using Python with the following libraries:
    - Pandas: To create and save the dataset as a CSV file
    - NumPy: To generate random numbers and simulate realistic data
    - Random Seed: Set to ensure reproducibility

    The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

    🔧 Usage

    This dataset is intended for:
    - Machine Learning and Deep Learning classification tasks
    - Data exploration and feature analysis
    - Model evaluation and comparison
    - Educational and research purposes

    📊 Potential Applications

    1. Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
    2. Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
    3. Data Visualization: Perform EDA to uncover patterns and insights.
    4. Model Benchmarking: Compare various algorithms for TB detection.

    📑 License

    This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

    🙌 Acknowledgments

    This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

  11. l

    Supplementary information files for A genetically-optimised artificial life...

    • repository.lboro.ac.uk
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Houston; Georgina Cosma (2023). Supplementary information files for A genetically-optimised artificial life algorithm for complexity-based synthetic dataset generation [Dataset]. http://doi.org/10.17028/rd.lboro.22354462.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Loughborough University
    Authors
    Andrew Houston; Georgina Cosma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary files for article A genetically-optimised artificial life algorithm for complexity-based synthetic dataset generation

    Algorithmic evaluation is a vital step in developing new approaches to machine learning and relies on the availability of existing datasets. However, real-world datasets often do not cover the necessary complexity space required to understand an algorithm’s domains of competence. As such, the generation of synthetic datasets to fill gaps in the complexity space has gained attention, offering a means of evaluating algorithms when data is unavailable. Existing approaches to complexity-focused data generation are limited in their ability to generate solutions that invoke similar classification behaviour to real data. The present work proposes a novel method (Sy:Boid) for complexity-based synthetic data generation, adapting and extending the Boid algorithm that was originally intended for computer graphics simulations. Sy:Boid embeds the modified Boid algorithm within an evolutionary multi-objective optimisation algorithm to generate synthetic datasets which satisfy predefined magnitudes of complexity measures. Sy:Boid is evaluated and compared to labelling-based and sampling-based approaches to data generation to understand its ability to generate a wide variety of realistic datasets. Results demonstrate Sy:Boid is capable of generating datasets across a greater portion of the complexity space than existing approaches. Furthermore, the produced datasets were observed to invoke very similar classification behaviours to that of real data.

  12. f

    CK4Gen, High Utility Synthetic Survival Datasets

    • figshare.com
    zip
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    figshare
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.

  13. f

    Aggregated economic and management sciences synthetic dataset for the UFS

    • ufs.figshare.com
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Herkulaas Combrink (2023). Aggregated economic and management sciences synthetic dataset for the UFS [Dataset]. http://doi.org/10.38140/ufs.22128449.v1
    Explore at:
    Dataset updated
    Feb 22, 2023
    Dataset provided by
    University of the Free State
    Authors
    Herkulaas Combrink
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was the aggregation of the synthetic data created from an Economic and Management Sciences Dataset and Synthetic Data was created from this for analytics

  14. Z

    Data from: A Large-scale Synthetic Pathological Dataset for Deep...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaoting Zhang (2023). A Large-scale Synthetic Pathological Dataset for Deep Learning-enabled Segmentation of Breast Cancer [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6633720
    Explore at:
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Kexin Ding
    Olivier Gevaert
    Mu Zhou
    He Wang
    Dimitris Metaxas
    Shaoting Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset access for the paper: A Large-scale Synthetic Pathological Dataset for Deep Learning-enabled Segmentation of Breast Cancer

  15. v

    Synthetic Data Generation Market By Offering (Solution/Platform, Services),...

    • verifiedmarketresearch.com
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2025). Synthetic Data Generation Market By Offering (Solution/Platform, Services), Data Type (Tabular, Text, Image, Video), Application (AI/ML Training & Development, Test Data Management), & Region for 2026-2032 [Dataset]. https://www.verifiedmarketresearch.com/product/synthetic-data-generation-market/
    Explore at:
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2026 - 2032
    Area covered
    Global
    Description

    Synthetic Data Generation Market size was valued at USD 0.4 Billion in 2024 and is projected to reach USD 9.3 Billion by 2032, growing at a CAGR of 46.5 % from 2026 to 2032.

    The Synthetic Data Generation Market is driven by the rising demand for AI and machine learning, where high-quality, privacy-compliant data is crucial for model training. Businesses seek synthetic data to overcome real-data limitations, ensuring security, diversity, and scalability without regulatory concerns. Industries like healthcare, finance, and autonomous vehicles increasingly adopt synthetic data to enhance AI accuracy while complying with stringent privacy laws.

    Additionally, cost efficiency and faster data availability fuel market growth, reducing dependency on expensive, time-consuming real-world data collection. Advancements in generative AI, deep learning, and simulation technologies further accelerate adoption, enabling realistic synthetic datasets for robust AI model development.

  16. Synthetic dataset of user interactions - postpartum depression.csv

    • zenodo.org
    • portalcientifico.uvigo.gal
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvia García Méndez; Silvia García Méndez; Francisco de Arriba Pérez; Francisco de Arriba Pérez (2024). Synthetic dataset of user interactions - postpartum depression.csv [Dataset]. http://doi.org/10.5281/zenodo.14049634
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Silvia García Méndez; Silvia García Méndez; Francisco de Arriba Pérez; Francisco de Arriba Pérez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A synthetic data set composed of 200 users' utterances as possible answers to questions related to these topics:

    (i) Feeling sad or Tearful
    (ii) Irritable towards baby & partner
    (iii) Trouble sleeping at night
    (iv) Problems concentrating or making decision
    (v) Overeating or loss of appetite
    (vi) Feeling of guilt
    (vii) Problems of bonding with baby
    (viii) Suicide attempt

  17. Z

    DJIN model of aging synthetic dataset

    • data.niaid.nih.gov
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnold Mitnitski (2021). DJIN model of aging synthetic dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4733385
    Explore at:
    Dataset updated
    May 4, 2021
    Dataset provided by
    Arnold Mitnitski
    Andrew Rutenberg
    Kenneth Rockwood
    Spencer Farrell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The DJIN model of aging was trained on the English Longitudinal Study of Aging (ELSA). Here we have used the model to generate a large synthetic population of 9 million individuals. There are 3 million individuals for each baseline age of 65, 75, and 85 years simulated for 20 years. For each individual, we supply a health trajectory with 29 tracked health variables with mortality. Demographic and background health variables have been sampled based on the ELSA population demographics.

    Each Data_part includes 1.8 million individuals. The file_description.txt file describes the files, and health_columns.csv and background_columns.csv indicate the columns of the files.

    The ELSA dataset itself can be accessed at https://www.elsa-project.ac.uk/accessing-elsa-data.

    Code for the model is available at https://github.com/Spencerfar/djin-aging.

  18. D

    TiCaM: Synthetic Images Dataset

    • datasetninja.com
    Updated May 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jigyasa Katrolia; Jason Raphael Rambach; Bruno Mirbach (2021). TiCaM: Synthetic Images Dataset [Dataset]. https://datasetninja.com/ticam-synthetic-images
    Explore at:
    Dataset updated
    May 23, 2021
    Dataset provided by
    Dataset Ninja
    Authors
    Jigyasa Katrolia; Jason Raphael Rambach; Bruno Mirbach
    License

    https://spdx.org/licenses/https://spdx.org/licenses/

    Description

    TiCaM Synthectic Images: A Time-of-Flight In-Car Cabin Monitoring Dataset is a time-of-flight dataset of car in-cabin images providing means to test extensive car cabin monitoring systems based on deep learning methods. The authors provide a synthetic image dataset of car cabin images similar to the real dataset leveraging advanced simulation software’s capability to generate abundant data with little effort. This can be used to test domain adaptation between synthetic and real data for select classes. For both datasets the authors provide ground truth annotations for 2D and 3D object detection, as well as for instance segmentation.

  19. f

    Table1_Enhancing biomechanical machine learning with limited data:...

    • frontiersin.figshare.com
    pdf
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Frontiers
    Authors
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.

  20. d

    Synthea lung cancer synthetic patient data series for ML

    • search.dataone.org
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Synthea lung cancer synthetic patient data series for ML [Dataset]. http://doi.org/10.7910/DVN/Q5LK5A
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction in simulation of ML-enabled learning health systems. Five populations of 30K patients were generated by the Synthea patient generator. They were combined sequentially to form 5 different size populations, from 30K to 150K patients. Patients with or without lung cancer were selected roughly at 1:3 ratio and their electronic health records (EHR) were processed to data table files ready for machine learning. The ML-ready table files also have the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble to real patients, these ML-ready dataset can be used to develop and test ML algorithms, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first use of these datasets was in a LHS simulation study, which was published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carlos Santos (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. https://ieee-dataport.org/open-access/dataset-article-synthetic-datasets-generator-testing-information-visualization-and

Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools

Explore at:
Dataset updated
Mar 13, 2020
Authors
Carlos Santos
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

Search
Clear search
Close search
Google apps
Main menu