15 datasets found
  1. Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale...

    • figshare.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandranil Chakraborttii (2025). Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems [Dataset]. http://doi.org/10.6084/m9.figshare.28878830.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chandranil Chakraborttii
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.

  2. E

    Bitext Synthetic Data - Moving and storage (English language)

    • catalog.elra.info
    Updated Jul 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2023). Bitext Synthetic Data - Moving and storage (English language) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-L0174/
    Explore at:
    Dataset updated
    Jul 18, 2023
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English language (see ELRA-L0162 to ELRA-L0181). They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. The Moving and storage domain comprises 29 intents for English.Data is distributed as models or open text files.

  3. h

    Synthetic dataset of hospitalised patients with an acute exacerbation of...

    • healthdatagateway.org
    unknown
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Synthetic dataset of hospitalised patients with an acute exacerbation of asthma [Dataset]. https://healthdatagateway.org/dataset/1015
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Dec 17, 2024
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    To support respiratory research, a synthetic asthma dataset was generated based on a real-world data, originally documenting 381 patients with physician-confirmed asthma who were admitted to secondary care at a single centre in 2019. The dataset is highly detailed, covering demographics, structured physiological data, medication records, and clinical outcomes. The synthetic version extends to 561 patients admitted over a year, offering insights into patient patterns, risk factors, and treatment strategies.

    The dataset was created using the Synthetic Data Vault package, specifically employing the GAN synthesizer. Real data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.

    Geography: The West Midlands has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UHB runs a fully electronic healthcare record (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.

    Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

    Available supplementary data: Real world data. Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can provide real-world data upon request.

    Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.

  4. h

    Synthetic Dataset of Hospital Admissions for an acute Stroke

    • healthdatagateway.org
    unknown
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Synthetic Dataset of Hospital Admissions for an acute Stroke [Dataset]. https://healthdatagateway.org/en/dataset/1003
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    Strokes can be ischaemic or haemorrhagic in nature, leading to debilitating symptoms which are dependent on the location of the stroke in the brain and the severity of the insult. Stroke care is centred around Hyper-acute Stroke Units (HASU), Acute Stroke and Brain Injury Units (ASU/ABIU) and specialist stroke services. Early presentation enables the use of more invasive treatments to clear blood clots, but commonly strokes present late, preventing their use.

    This synthetic dataset represents approximately 29,000 stroke patients. Data includes demography, socioeconomic status, co-morbidities, “time stamped” serial acuity, physiology and treatments, investigations (structured and unstructured data), hospital care processes, and outcomes.

    The dataset was created using the Synthetic Data Vault (SDV) package, specifically employing the GAN synthesizer. Real. data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.

    Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute stroke services & specialist care across four hospital sites.

    Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

    Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can build synthetic data to meet bespoke requirements.

    Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.

  5. Simpson's paradox (synthetic data).

    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Ole Schwen; Sabrina Rueschenbaum (2023). Simpson's paradox (synthetic data). [Dataset]. http://doi.org/10.1371/journal.pcbi.1006141.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lars Ole Schwen; Sabrina Rueschenbaum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two treatments (A and B) were applied in two groups (1 and 2) of patients. Treatment A seems to be more successful in each of the groups viewed separately (100 > 87.5 and 66.7 > 50). However, evaluated for the combined group of patients, treatment B appears to be more successful (75 < 80).

  6. o

    Synthetic Heart Disease Dataset

    • opendatabay.com
    .undefined
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay Labs (2025). Synthetic Heart Disease Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/9969a415-c090-4564-99d6-eca151e9884d
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Opendatabay Labs
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Clinical Trials & Research
    Description

    A synthetic heart disease dataset has been generated to serve as an educational resource for data science, machine learning, and data analysis applications in the healthcare industry. It simulates patient records related to heart disease, allowing users to practice data manipulation and develop analytical skills in a healthcare context.

    Dataset Features:

    • Age: Age of the patient at admission (in years).
    • Country: Country of residence, specified as the USA.
    • State: Random assignments of U.S. states for geographic analysis.
    • Blood Pressure: Simulated values reflecting typical hypertension ranges (in mmHg).
    • Cholesterol: Values adjusted to fall within common cholesterol levels (in mg/dL).
    • BMI: Calculated to represent healthy to overweight classifications.
    • Glucose Level: Simulated to represent fasting glucose levels (in mg/dL).
    • Gender: Randomly assigned to simulate demographic diversity.
    • Hospital: Randomly assigned hospitals to represent different healthcare facilities.
    • Treatment Options: Various treatment methods including Physiotherapy, Medication, Surgery, Rehabilitation, and Counseling.
    • Treatment Date: Randomly generated dates for when treatments were administered.
    • Heart Disease: A binary indicator (0 = No, 1 = Yes) representing the presence of heart disease.

    Data Distribution and Outliers:

    https://storage.googleapis.com/opendatabay_public/images/image_88c9876e-c5a3-48be-837e-f1ea77d11693.png" alt="Synthetic Heart Disease Data">

    https://storage.googleapis.com/opendatabay_public/images/image_041922c7-f3dc-49c9-bfbf-16cdf98d6bd8.png" alt="Synthetic Heart Disease Patient Records Dataset">

    https://storage.googleapis.com/opendatabay_public/images/hearr_disease_09f51ed4-86d0-4ac4-b6c0-b7b376a9f7f2.png" alt="Synthetic Heart Disease Statistics">

    https://storage.googleapis.com/opendatabay_public/images/heart_disease3_abb20b90-1bbd-4e2c-87ce-a47f1e414583.png" alt="Synthetic Heart Disease Data Distribution">

    https://storage.googleapis.com/opendatabay_public/images/heart_disease4_64b65bf1-9b53-4ab1-a7ea-3486c050f607.png" alt="Synthetic Heart Disease Dataset Heatmap and Correlation">

    Usage:

    This dataset can be used for: - Healthcare research: To explore trends and patterns in cardiovascular health, treatment efficacy, and patient demographics. - Educational training: To teach data cleaning, transformation, and visualisation techniques specific to healthcare data. - Predictive modelling: To develop models that predict heart disease risk based on various patient and demographic factors.

    Coverage:

    This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.

    License:

    CCO (Public Domain)

    Who can use it:

    • Researchers and educators: For studies or teaching purposes in healthcare analytics and data science.
    • Data science enthusiasts: For learning, practising, and applying healthcare data manipulation and analysis techniques.
  7. D

    DNA Digital Data Storage Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). DNA Digital Data Storage Report [Dataset]. https://www.datainsightsmarket.com/reports/dna-digital-data-storage-1944936
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jun 15, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The DNA digital data storage market is experiencing rapid growth, driven by the increasing demand for long-term, secure, and high-capacity data storage solutions. The market's inherent advantages over traditional methods, such as significantly higher storage density and exceptional longevity, are fueling its expansion. While currently a niche market, projections suggest substantial growth over the next decade. Key players like Twist Bioscience, Illumina, and Western Digital are actively investing in R&D and infrastructure development, leading to advancements in encoding techniques and cost reduction. The market is segmented by technology (e.g., synthetic DNA, natural DNA), application (e.g., archival storage, cold storage), and geographic region. North America and Europe currently dominate the market share, benefiting from robust research ecosystems and early adoption, but Asia-Pacific is poised for significant expansion due to its growing data centers and increasing technological investments. Challenges include the relatively high cost of DNA synthesis and sequencing, along with technological hurdles associated with error correction and data retrieval. Despite these challenges, ongoing research focuses on streamlining the synthesis and sequencing processes, reducing costs, and improving error-correction capabilities. This is resulting in increased accessibility of this technology to a wider range of users. Government initiatives supporting the development of advanced data storage solutions are further bolstering market growth. The forecast period (2025-2033) anticipates a strong CAGR, driven by factors such as increasing data generation across various sectors (healthcare, genomics, finance) and the growing need for secure and sustainable data archiving. The longer-term outlook remains positive, with the potential for DNA data storage to become a mainstream technology in the coming decades, revolutionizing data management and storage capacity across industries.

  8. F

    CoSense3D

    • data.uni-hannover.de
    json, zip
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Kartographie und Geoinformatik (2024). CoSense3D [Dataset]. https://data.uni-hannover.de/dataset/cosense3d
    Explore at:
    json(4774), zip(152892382), zip(27096446), zip(101805677)Available download formats
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Institut für Kartographie und Geoinformatik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Repo provides all datasets or their external links for the project CoSense3D. The related datasets are:

    • COMAP: A synthetic data generated by CARLA for cooperative perception.

    • OPV2Vt: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V for the purpose of globally time-aligned cooperative object detection (TA-COOD). The original replay files are interpolated to obtain the object and sensor locations at sub-frames. Each frame is spitted into 10 sub-frames for simulation.

    • DairV2Xt: New generate meta files based on dataset DAIR-V2X for the project CoSense3D with localization correction and ground truth generate for TA-COOD.

    • OPV2Va: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V augmented with semantic labels.

  9. f

    Nanopore sequencing results of a 64-bit Z-DNA encryption key and its...

    • figshare.com
    application/gzip
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lifu Song (2025). Nanopore sequencing results of a 64-bit Z-DNA encryption key and its amplified counterpart [Dataset]. http://doi.org/10.6084/m9.figshare.28016012.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    figshare
    Authors
    Lifu Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nanopore sequencing results of a 64-bit Z-DNA encryption key and its amplified counterpart.File 64-bit-key.fq.gzThis file contains the Nanopore sequencing results of the original 64-bit Z-DNA key.File 64-bit-key-PCR.fq.gzThis file includes the Nanopore sequencing results of the PCR-amplified sample derived from the original 64-bit Z-DNA key.

  10. f

    Illumina sequencing reads of the Babel-DNA storage

    • figshare.com
    application/gzip
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lifu Song; Gaoli Wang; Yunkun Zhang; Xuesong Liu; Yingjin Yuan; Yunzi Luo; Yan Zhang (2023). Illumina sequencing reads of the Babel-DNA storage [Dataset]. http://doi.org/10.6084/m9.figshare.20424126.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 6, 2023
    Dataset provided by
    figshare
    Authors
    Lifu Song; Gaoli Wang; Yunkun Zhang; Xuesong Liu; Yingjin Yuan; Yunzi Luo; Yan Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sequencing reads of Babel-DNA storage.

  11. Global DNA Data Storage Market Size By Type (Synthetic DNA-Based Storage,...

    • verifiedmarketresearch.com
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2025). Global DNA Data Storage Market Size By Type (Synthetic DNA-Based Storage, Natural DNA-Based Storage), By Application (Archival Storage, Data Backup And Recovery), By End-User (Healthcare And Biotechnology, Government And Defense), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/dna-data-storage-market/
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2026 - 2032
    Area covered
    Global
    Description

    DNA Data Storage Market size was valued at USD 126.76 Million in 2024 and is projected to reach USD 6,241.39 Million by 2032, growing at a CAGR of 74.48% from 2026 to 2032.

    Global DNA Data Storage Market Overview

    The exponential growth in global data generation is shaping new trends in storage technology. With the proliferation of artificial intelligence (AI), the Internet of Things (IoT), high-resolution video content, and expansive cloud services, global data volume is projected to exceed 149 zettabytes by 2024. This data explosion is pushing current storage infrastructures like hard drives and data centers beyond their capacity due to their limitations in scalability, energy efficiency, and sustainability. In response, the tech industry is increasingly turning toward innovative storage mediums, with DNA data storage gaining significant traction for its potential to offer high density, long lifespan, and minimal energy requirements.

  12. D

    DNA Digital Data Storage Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). DNA Digital Data Storage Report [Dataset]. https://www.datainsightsmarket.com/reports/dna-digital-data-storage-1982431
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The DNA digital data storage market is projected to experience significant growth over the forecast period, with a CAGR of XX%. The market is driven by the increasing demand for data storage solutions that are more reliable, efficient, and cost-effective than traditional methods. DNA digital data storage offers a unique combination of advantages, including its ability to store vast amounts of data in a compact form, its long-term stability, and its resistance to environmental factors. Key drivers of the market growth include the increasing demand for data storage in various industries, such as healthcare, finance, and manufacturing, as well as the technological advancements in DNA sequencing and synthesis techniques. The growing adoption of cloud computing and the Internet of Things (IoT) is also driving the demand for scalable and cost-effective data storage solutions. The market is segmented by application, type, and region. The major applications of DNA digital data storage include healthcare, research, and enterprise data storage. The types of DNA digital data storage include synthetic DNA and natural DNA. The major companies operating in the market include Twist Bioscience, Western Digital, Microsoft, Illumina, Thermo Fisher Scientific, Siemens, Beckman Coulter, F. Hoffmann-La Roche, and Catalog.

  13. De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage...

    • zenodo.org
    bin
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier Boullé; Olivier Boullé (2025). De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing; oligo references for synthetic datasets [Dataset]. http://doi.org/10.5281/zenodo.15211943
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Olivier Boullé; Olivier Boullé
    Description

    These files contain the references for the oligos used in the synthetic data set tested in the paper :

    "De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing"

    These oligo have been generated from images from the project ConCluD https://gitlab.inria.fr/pim/org.pim.dnarxiv

    using paper_scripts/datasets/IM-1-10-100/img_to_oligos.py

  14. o

    Synthetic Sleep and Lifestyle Behavior Dataset

    • opendatabay.com
    .csv
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay Labs (2025). Synthetic Sleep and Lifestyle Behavior Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/addc0552-dd20-4a86-bdb5-ee1a95594b77
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset authored and provided by
    Opendatabay Labs
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Mental Health & Wellness
    Description

    The Sleep and Lifestyle Behavior Dataset is a synthetic dataset designed to provide insights into the relationship between daily habits and sleep health. It comprises 100,000 rows and 14 columns, offering a comprehensive view of various lifestyle and health factors. This dataset includes key variables such as gender, age, occupation, sleep patterns, physical activity, stress levels, and cardiovascular health metrics, as well as the presence of sleep disorders.

    Dataset Features:

    • Gender: Gender of the person (e.g., "Female", "Male").
    • Age: Age of the person (in years). - Occupation: Occupation or profession of the person (e.g., "Accountant", "Doctor", "Engineer").
    • Sleep Duration: The number of hours the person sleeps per night.
    • Quality of Sleep: A subjective rating of the person's sleep quality, on a scale of 1 to 10. - Physical Activity Level: The number of minutes the person engages in physical activity daily. - Stress Level: A subjective rating of the person's stress level, on a scale of 1 to 10.
    • BMI Category: The BMI category of the person (e.g., "Underweight", "Normal", "Overweight"). - Heart Rate: The resting heart rate of the person, measured in beats per minute (bpm).
    • Daily Steps: The number of steps the person takes per day. - Sleep Disorder: The presence or absence of a sleep disorder (e.g., "None", "Insomnia", "Sleep Apnea"). - Systolic: The systolic blood pressure measurement (in mmHg). Diastolic: The diastolic blood pressure measurement (in mmHg).

    Data Distributions and Outliers

    https://storage.googleapis.com/opendatabay_public/images/download_94f7fbeb-90ac-4ae7-acb7-3e9e27fa20cf.png" alt="Synthetic Sleep and Lifestyle Behavior Data">

    https://storage.googleapis.com/opendatabay_public/images/download_(1)_a67becb2-c287-4d74-8215-5926a870221b.png" alt="Synthetic Sleep and Lifestyle Behaviour Data Distribution ">

    https://storage.googleapis.com/opendatabay_public/images/download_(2)_5d7db1cd-d15a-40b8-b484-8cda6bf09685.png" alt="Synthetic Sleep and Lifestyle Behavioral Stats">

    https://storage.googleapis.com/opendatabay_public/images/download_(3)_e0c63a0b-ffac-45d6-b42a-a011dc17a297.png" alt="Synthetic Sleep and Lifestyle Behavior Dataset Matrics">

    Usage:

    This dataset can be used for:

    Healthcare research: Investigate the relationships between daily habits, lifestyle factors, and sleep health, and explore trends in the occurrence of sleep disorders. Educational training: Use it for teaching data analysis, machine learning, and statistical techniques, with a focus on health and wellness. Predictive modelling: Build models to predict sleep quality, the likelihood of sleep disorders, or cardiovascular health based on daily activities and stress levels.. Coverage: This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.

    License:

    CCO (Public Domain)

    Who can use it:

    • Researchers and educators: For studies in health analytics, machine learning, and data science, or as an educational resource in healthcare-related courses.
    • Data science enthusiasts: Perfect for practising data manipulation, cleaning, and predictive modelling in the context of health and wellness.
  15. o

    Synthetic Diabetes Patient Records Dataset

    • opendatabay.com
    .csv
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay Labs (2025). Synthetic Diabetes Patient Records Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/97a5e494-ba32-4c95-b5dc-e099628b966c
    Explore at:
    .csvAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Opendatabay Labs
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Patient Health Records & Digital Health
    Description

    The Diabetes Dataset is a comprehensive resource designed to support researchers, data scientists, and healthcare professionals interested in diabetes risk assessment and prediction. With a broad spectrum of health-related attributes, this dataset is ideal for developing predictive models and exploring factors associated with diabetes risk. By providing this dataset, we aim to encourage collaboration and innovation in data science and healthcare, potentially leading to more accurate early diagnoses and personalized diabetes treatment strategies.

    Dataset Features:

    • Id: Unique identifier for each data entry.
    • Pregnancies: Number of times the patient has been pregnant.
    • Glucose: Plasma glucose concentration measured over a 2-hour period during an oral glucose tolerance test.
    • Blood_Pressure: Diastolic blood pressure in mm Hg.
    • Skin_Thickness: Thickness of the triceps skinfold, measured in mm.
    • Insulin: Serum insulin level after 2 hours (mu U/ml).
    • BMI: Body mass index, calculated as weight in kg divided by height in m².
    • Diabetes_Pedigree: Genetic risk score for diabetes, indicating familial history.
    • Age: Age of the patient in years.
    • Outcome: A binary variable indicating diabetes status; 1 indicates diabetes presence, while 0 indicates its absence.

    Data distribution and Outliers:

    https://storage.googleapis.com/opendatabay_public/images/image_61dbe587-4a7d-4307-99ab-e46e63ca0e5b.png" alt="Synthetic Diabetes Patient Records Data">

    https://storage.googleapis.com/opendatabay_public/images/image_2d5fea3b-d555-4d4a-b109-0e350ae156d5.png" alt="Synthetic Diabetes Patient Records Distribution">

    Correlations and Relationships:

    https://storage.googleapis.com/opendatabay_public/images/image_3c7351ac-01c8-489c-823d-bae6ca7fe202.png" alt="Synthetic Diabetes Patient Records Data Correlation">

    https://storage.googleapis.com/opendatabay_public/images/diabetes2_85a01003-2848-495d-89d9-2d6a21fd77c0.png" alt="Synthetic Diabetes Patient Records Statistic">

    https://storage.googleapis.com/opendatabay_public/images/diabetes_copy_f0b0dfc6-56de-42ed-851a-249affebe105.jpg" alt="Synthetic Diabetes Patient Records EMR">

    Usage:

    This dataset can be used for:

    • Diabetes research: To analyze and uncover patterns in diabetes risk factors and demographics.
    • Educational purposes: Teaching data science skills such as cleaning, transformation, visualization, and model development within a healthcare context.
    • Predictive modelling: Building models that assess diabetes risk, support feature selection, and enable insights into the health indicators of diabetes.

    Coverage:

    As a synthetic and anonymized dataset, it offers a secure environment for experimentation and learning without compromising individual privacy.

    License:

    CCO (Public Domain)

    Who can use it:

    • Researchers and educators: Ideal for studies and teaching diabetes analytics and healthcare data science.
    • Data science enthusiasts and professionals: For practising data manipulation, feature engineering, and machine learning modelling focused on diabetes prediction.
  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chandranil Chakraborttii (2025). Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems [Dataset]. http://doi.org/10.6084/m9.figshare.28878830.v1
Organization logo

Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems

Explore at:
zipAvailable download formats
Dataset updated
Apr 27, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chandranil Chakraborttii
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.

Search
Clear search
Close search
Google apps
Main menu