15 datasets found

Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale...
figshare.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chandranil Chakraborttii (2025). Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems [Dataset]. http://doi.org/10.6084/m9.figshare.28878830.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28878830.v1
Dataset updated
Apr 27, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chandranil Chakraborttii
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.
E
Bitext Synthetic Data - Moving and storage (English language)
catalog.elra.info
Updated Jul 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2023). Bitext Synthetic Data - Moving and storage (English language) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-L0174/
Explore at:
Dataset updated
Jul 18, 2023
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English language (see ELRA-L0162 to ELRA-L0181). They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. The Moving and storage domain comprises 29 intents for English.Data is distributed as models or open text files.
h
Synthetic dataset of hospitalised patients with an acute exacerbation of...
healthdatagateway.org
unknown
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Synthetic dataset of hospitalised patients with an acute exacerbation of asthma [Dataset]. https://healthdatagateway.org/dataset/1015
Explore at:
unknownAvailable download formats
Dataset updated
Dec 17, 2024
Dataset authored and provided by
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
License
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Description
To support respiratory research, a synthetic asthma dataset was generated based on a real-world data, originally documenting 381 patients with physician-confirmed asthma who were admitted to secondary care at a single centre in 2019. The dataset is highly detailed, covering demographics, structured physiological data, medication records, and clinical outcomes. The synthetic version extends to 561 patients admitted over a year, offering insights into patient patterns, risk factors, and treatment strategies.

The dataset was created using the Synthetic Data Vault package, specifically employing the GAN synthesizer. Real data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.

Geography: The West Midlands has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UHB runs a fully electronic healthcare record (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.

Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

Available supplementary data: Real world data. Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can provide real-world data upon request.

Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
h
Synthetic Dataset of Hospital Admissions for an acute Stroke
healthdatagateway.org
unknown
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Synthetic Dataset of Hospital Admissions for an acute Stroke [Dataset]. https://healthdatagateway.org/en/dataset/1003
Explore at:
unknownAvailable download formats
Dataset updated
Dec 4, 2024
Dataset authored and provided by
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
License
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Description
Strokes can be ischaemic or haemorrhagic in nature, leading to debilitating symptoms which are dependent on the location of the stroke in the brain and the severity of the insult. Stroke care is centred around Hyper-acute Stroke Units (HASU), Acute Stroke and Brain Injury Units (ASU/ABIU) and specialist stroke services. Early presentation enables the use of more invasive treatments to clear blood clots, but commonly strokes present late, preventing their use.

This synthetic dataset represents approximately 29,000 stroke patients. Data includes demography, socioeconomic status, co-morbidities, “time stamped” serial acuity, physiology and treatments, investigations (structured and unstructured data), hospital care processes, and outcomes.

The dataset was created using the Synthetic Data Vault (SDV) package, specifically employing the GAN synthesizer. Real. data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.

Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute stroke services & specialist care across four hospital sites.

Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can build synthetic data to meet bespoke requirements.

Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
Simpson's paradox (synthetic data).
plos.figshare.com
xls
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lars Ole Schwen; Sabrina Rueschenbaum (2023). Simpson's paradox (synthetic data). [Dataset]. http://doi.org/10.1371/journal.pcbi.1006141.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1006141.t002
Dataset updated
Jun 17, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lars Ole Schwen; Sabrina Rueschenbaum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two treatments (A and B) were applied in two groups (1 and 2) of patients. Treatment A seems to be more successful in each of the groups viewed separately (100 > 87.5 and 66.7 > 50). However, evaluated for the combined group of patients, treatment B appears to be more successful (75 < 80).
o
Synthetic Heart Disease Dataset
opendatabay.com
.undefined
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Heart Disease Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/9969a415-c090-4564-99d6-eca151e9884d
Explore at:
.undefinedAvailable download formats
Dataset updated
May 19, 2025
Dataset authored and provided by
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Clinical Trials & Research
Description
A synthetic heart disease dataset has been generated to serve as an educational resource for data science, machine learning, and data analysis applications in the healthcare industry. It simulates patient records related to heart disease, allowing users to practice data manipulation and develop analytical skills in a healthcare context.

Dataset Features:

Age: Age of the patient at admission (in years).

Country: Country of residence, specified as the USA.

State: Random assignments of U.S. states for geographic analysis.

Blood Pressure: Simulated values reflecting typical hypertension ranges (in mmHg).

Cholesterol: Values adjusted to fall within common cholesterol levels (in mg/dL).

BMI: Calculated to represent healthy to overweight classifications.

Glucose Level: Simulated to represent fasting glucose levels (in mg/dL).

Gender: Randomly assigned to simulate demographic diversity.

Hospital: Randomly assigned hospitals to represent different healthcare facilities.

Treatment Options: Various treatment methods including Physiotherapy, Medication, Surgery, Rehabilitation, and Counseling.

Treatment Date: Randomly generated dates for when treatments were administered.

Heart Disease: A binary indicator (0 = No, 1 = Yes) representing the presence of heart disease.

Data Distribution and Outliers:

https://storage.googleapis.com/opendatabay_public/images/image_88c9876e-c5a3-48be-837e-f1ea77d11693.png" alt="Synthetic Heart Disease Data">

https://storage.googleapis.com/opendatabay_public/images/image_041922c7-f3dc-49c9-bfbf-16cdf98d6bd8.png" alt="Synthetic Heart Disease Patient Records Dataset">

https://storage.googleapis.com/opendatabay_public/images/hearr_disease_09f51ed4-86d0-4ac4-b6c0-b7b376a9f7f2.png" alt="Synthetic Heart Disease Statistics">

https://storage.googleapis.com/opendatabay_public/images/heart_disease3_abb20b90-1bbd-4e2c-87ce-a47f1e414583.png" alt="Synthetic Heart Disease Data Distribution">

https://storage.googleapis.com/opendatabay_public/images/heart_disease4_64b65bf1-9b53-4ab1-a7ea-3486c050f607.png" alt="Synthetic Heart Disease Dataset Heatmap and Correlation">

Usage:

This dataset can be used for: - Healthcare research: To explore trends and patterns in cardiovascular health, treatment efficacy, and patient demographics. - Educational training: To teach data cleaning, transformation, and visualisation techniques specific to healthcare data. - Predictive modelling: To develop models that predict heart disease risk based on various patient and demographic factors.

Coverage:

This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.

License:

CCO (Public Domain)

Who can use it:

Researchers and educators: For studies or teaching purposes in healthcare analytics and data science.

Data science enthusiasts: For learning, practising, and applying healthcare data manipulation and analysis techniques.
D
DNA Digital Data Storage Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). DNA Digital Data Storage Report [Dataset]. https://www.datainsightsmarket.com/reports/dna-digital-data-storage-1944936
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The DNA digital data storage market is experiencing rapid growth, driven by the increasing demand for long-term, secure, and high-capacity data storage solutions. The market's inherent advantages over traditional methods, such as significantly higher storage density and exceptional longevity, are fueling its expansion. While currently a niche market, projections suggest substantial growth over the next decade. Key players like Twist Bioscience, Illumina, and Western Digital are actively investing in R&D and infrastructure development, leading to advancements in encoding techniques and cost reduction. The market is segmented by technology (e.g., synthetic DNA, natural DNA), application (e.g., archival storage, cold storage), and geographic region. North America and Europe currently dominate the market share, benefiting from robust research ecosystems and early adoption, but Asia-Pacific is poised for significant expansion due to its growing data centers and increasing technological investments. Challenges include the relatively high cost of DNA synthesis and sequencing, along with technological hurdles associated with error correction and data retrieval. Despite these challenges, ongoing research focuses on streamlining the synthesis and sequencing processes, reducing costs, and improving error-correction capabilities. This is resulting in increased accessibility of this technology to a wider range of users. Government initiatives supporting the development of advanced data storage solutions are further bolstering market growth. The forecast period (2025-2033) anticipates a strong CAGR, driven by factors such as increasing data generation across various sectors (healthcare, genomics, finance) and the growing need for secure and sustainable data archiving. The longer-term outlook remains positive, with the potential for DNA data storage to become a mainstream technology in the coming decades, revolutionizing data management and storage capacity across industries.
F
CoSense3D
data.uni-hannover.de
json, zip
Updated Dec 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut für Kartographie und Geoinformatik (2024). CoSense3D [Dataset]. https://data.uni-hannover.de/dataset/cosense3d
Explore at:
json(4774), zip(152892382), zip(27096446), zip(101805677)Available download formats
Dataset updated
Dec 12, 2024
Dataset authored and provided by
Institut für Kartographie und Geoinformatik
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Repo provides all datasets or their external links for the project CoSense3D. The related datasets are:

COMAP: A synthetic data generated by CARLA for cooperative perception.

OPV2Vt: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V for the purpose of globally time-aligned cooperative object detection (TA-COOD). The original replay files are interpolated to obtain the object and sensor locations at sub-frames. Each frame is spitted into 10 sub-frames for simulation.

DairV2Xt: New generate meta files based on dataset DAIR-V2X for the project CoSense3D with localization correction and ground truth generate for TA-COOD.

OPV2Va: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V augmented with semantic labels.
f
Nanopore sequencing results of a 64-bit Z-DNA encryption key and its...
figshare.com
application/gzip
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lifu Song (2025). Nanopore sequencing results of a 64-bit Z-DNA encryption key and its amplified counterpart [Dataset]. http://doi.org/10.6084/m9.figshare.28016012.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28016012.v1
Dataset updated
Jan 6, 2025
Dataset provided by
figshare
Authors
Lifu Song
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Nanopore sequencing results of a 64-bit Z-DNA encryption key and its amplified counterpart.File 64-bit-key.fq.gzThis file contains the Nanopore sequencing results of the original 64-bit Z-DNA key.File 64-bit-key-PCR.fq.gzThis file includes the Nanopore sequencing results of the PCR-amplified sample derived from the original 64-bit Z-DNA key.
f
Illumina sequencing reads of the Babel-DNA storage
figshare.com
application/gzip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lifu Song; Gaoli Wang; Yunkun Zhang; Xuesong Liu; Yingjin Yuan; Yunzi Luo; Yan Zhang (2023). Illumina sequencing reads of the Babel-DNA storage [Dataset]. http://doi.org/10.6084/m9.figshare.20424126.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20424126.v1
Dataset updated
Apr 6, 2023
Dataset provided by
figshare
Authors
Lifu Song; Gaoli Wang; Yunkun Zhang; Xuesong Liu; Yingjin Yuan; Yunzi Luo; Yan Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sequencing reads of Babel-DNA storage.
Global DNA Data Storage Market Size By Type (Synthetic DNA-Based Storage,...
verifiedmarketresearch.com
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2025). Global DNA Data Storage Market Size By Type (Synthetic DNA-Based Storage, Natural DNA-Based Storage), By Application (Archival Storage, Data Backup And Recovery), By End-User (Healthcare And Biotechnology, Government And Defense), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/dna-data-storage-market/
Explore at:
Dataset updated
Mar 26, 2025
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2026 - 2032
Area covered
Global
Description
DNA Data Storage Market size was valued at USD 126.76 Million in 2024 and is projected to reach USD 6,241.39 Million by 2032, growing at a CAGR of 74.48% from 2026 to 2032.

Global DNA Data Storage Market Overview

The exponential growth in global data generation is shaping new trends in storage technology. With the proliferation of artificial intelligence (AI), the Internet of Things (IoT), high-resolution video content, and expansive cloud services, global data volume is projected to exceed 149 zettabytes by 2024. This data explosion is pushing current storage infrastructures like hard drives and data centers beyond their capacity due to their limitations in scalability, energy efficiency, and sustainability. In response, the tech industry is increasingly turning toward innovative storage mediums, with DNA data storage gaining significant traction for its potential to offer high density, long lifespan, and minimal energy requirements.
D
DNA Digital Data Storage Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). DNA Digital Data Storage Report [Dataset]. https://www.datainsightsmarket.com/reports/dna-digital-data-storage-1982431
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jan 13, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The DNA digital data storage market is projected to experience significant growth over the forecast period, with a CAGR of XX%. The market is driven by the increasing demand for data storage solutions that are more reliable, efficient, and cost-effective than traditional methods. DNA digital data storage offers a unique combination of advantages, including its ability to store vast amounts of data in a compact form, its long-term stability, and its resistance to environmental factors. Key drivers of the market growth include the increasing demand for data storage in various industries, such as healthcare, finance, and manufacturing, as well as the technological advancements in DNA sequencing and synthesis techniques. The growing adoption of cloud computing and the Internet of Things (IoT) is also driving the demand for scalable and cost-effective data storage solutions. The market is segmented by application, type, and region. The major applications of DNA digital data storage include healthcare, research, and enterprise data storage. The types of DNA digital data storage include synthetic DNA and natural DNA. The major companies operating in the market include Twist Bioscience, Western Digital, Microsoft, Illumina, Thermo Fisher Scientific, Siemens, Beckman Coulter, F. Hoffmann-La Roche, and Catalog.
De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage...
zenodo.org
bin
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Boullé; Olivier Boullé (2025). De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing; oligo references for synthetic datasets [Dataset]. http://doi.org/10.5281/zenodo.15211943
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15211943
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Olivier Boullé; Olivier Boullé
Description
These files contain the references for the oligos used in the synthetic data set tested in the paper :

"De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing"

These oligo have been generated from images from the project ConCluD https://gitlab.inria.fr/pim/org.pim.dnarxiv

using paper_scripts/datasets/IM-1-10-100/img_to_oligos.py
o
Synthetic Sleep and Lifestyle Behavior Dataset
opendatabay.com
.csv
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Sleep and Lifestyle Behavior Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/addc0552-dd20-4a86-bdb5-ee1a95594b77
Explore at:
.csvAvailable download formats
Dataset updated
Apr 30, 2025
Dataset authored and provided by
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Mental Health & Wellness
Description
The Sleep and Lifestyle Behavior Dataset is a synthetic dataset designed to provide insights into the relationship between daily habits and sleep health. It comprises 100,000 rows and 14 columns, offering a comprehensive view of various lifestyle and health factors. This dataset includes key variables such as gender, age, occupation, sleep patterns, physical activity, stress levels, and cardiovascular health metrics, as well as the presence of sleep disorders.

Dataset Features:

Gender: Gender of the person (e.g., "Female", "Male").

Age: Age of the person (in years). - Occupation: Occupation or profession of the person (e.g., "Accountant", "Doctor", "Engineer").

Sleep Duration: The number of hours the person sleeps per night.

Quality of Sleep: A subjective rating of the person's sleep quality, on a scale of 1 to 10. - Physical Activity Level: The number of minutes the person engages in physical activity daily. - Stress Level: A subjective rating of the person's stress level, on a scale of 1 to 10.

BMI Category: The BMI category of the person (e.g., "Underweight", "Normal", "Overweight"). - Heart Rate: The resting heart rate of the person, measured in beats per minute (bpm).

Daily Steps: The number of steps the person takes per day. - Sleep Disorder: The presence or absence of a sleep disorder (e.g., "None", "Insomnia", "Sleep Apnea"). - Systolic: The systolic blood pressure measurement (in mmHg). Diastolic: The diastolic blood pressure measurement (in mmHg).

Data Distributions and Outliers

https://storage.googleapis.com/opendatabay_public/images/download_94f7fbeb-90ac-4ae7-acb7-3e9e27fa20cf.png" alt="Synthetic Sleep and Lifestyle Behavior Data">

https://storage.googleapis.com/opendatabay_public/images/download_(1)_a67becb2-c287-4d74-8215-5926a870221b.png" alt="Synthetic Sleep and Lifestyle Behaviour Data Distribution ">

https://storage.googleapis.com/opendatabay_public/images/download_(2)_5d7db1cd-d15a-40b8-b484-8cda6bf09685.png" alt="Synthetic Sleep and Lifestyle Behavioral Stats">

https://storage.googleapis.com/opendatabay_public/images/download_(3)_e0c63a0b-ffac-45d6-b42a-a011dc17a297.png" alt="Synthetic Sleep and Lifestyle Behavior Dataset Matrics">

Usage:

This dataset can be used for:

Healthcare research: Investigate the relationships between daily habits, lifestyle factors, and sleep health, and explore trends in the occurrence of sleep disorders. Educational training: Use it for teaching data analysis, machine learning, and statistical techniques, with a focus on health and wellness. Predictive modelling: Build models to predict sleep quality, the likelihood of sleep disorders, or cardiovascular health based on daily activities and stress levels.. Coverage: This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.

License:

CCO (Public Domain)

Who can use it:

Researchers and educators: For studies in health analytics, machine learning, and data science, or as an educational resource in healthcare-related courses.

Data science enthusiasts: Perfect for practising data manipulation, cleaning, and predictive modelling in the context of health and wellness.
o
Synthetic Diabetes Patient Records Dataset
opendatabay.com
.csv
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Diabetes Patient Records Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/97a5e494-ba32-4c95-b5dc-e099628b966c
Explore at:
.csvAvailable download formats
Dataset updated
May 6, 2025
Dataset authored and provided by
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Patient Health Records & Digital Health
Description
The Diabetes Dataset is a comprehensive resource designed to support researchers, data scientists, and healthcare professionals interested in diabetes risk assessment and prediction. With a broad spectrum of health-related attributes, this dataset is ideal for developing predictive models and exploring factors associated with diabetes risk. By providing this dataset, we aim to encourage collaboration and innovation in data science and healthcare, potentially leading to more accurate early diagnoses and personalized diabetes treatment strategies.

Dataset Features:

Id: Unique identifier for each data entry.

Pregnancies: Number of times the patient has been pregnant.

Glucose: Plasma glucose concentration measured over a 2-hour period during an oral glucose tolerance test.

Blood_Pressure: Diastolic blood pressure in mm Hg.

Skin_Thickness: Thickness of the triceps skinfold, measured in mm.

Insulin: Serum insulin level after 2 hours (mu U/ml).

BMI: Body mass index, calculated as weight in kg divided by height in m².

Diabetes_Pedigree: Genetic risk score for diabetes, indicating familial history.

Age: Age of the patient in years.

Outcome: A binary variable indicating diabetes status; 1 indicates diabetes presence, while 0 indicates its absence.

Data distribution and Outliers:

https://storage.googleapis.com/opendatabay_public/images/image_61dbe587-4a7d-4307-99ab-e46e63ca0e5b.png" alt="Synthetic Diabetes Patient Records Data">

https://storage.googleapis.com/opendatabay_public/images/image_2d5fea3b-d555-4d4a-b109-0e350ae156d5.png" alt="Synthetic Diabetes Patient Records Distribution">

Correlations and Relationships:

https://storage.googleapis.com/opendatabay_public/images/image_3c7351ac-01c8-489c-823d-bae6ca7fe202.png" alt="Synthetic Diabetes Patient Records Data Correlation">

https://storage.googleapis.com/opendatabay_public/images/diabetes2_85a01003-2848-495d-89d9-2d6a21fd77c0.png" alt="Synthetic Diabetes Patient Records Statistic">

https://storage.googleapis.com/opendatabay_public/images/diabetes_copy_f0b0dfc6-56de-42ed-851a-249affebe105.jpg" alt="Synthetic Diabetes Patient Records EMR">

Usage:

This dataset can be used for:

Diabetes research: To analyze and uncover patterns in diabetes risk factors and demographics.

Educational purposes: Teaching data science skills such as cleaning, transformation, visualization, and model development within a healthcare context.

Predictive modelling: Building models that assess diabetes risk, support feature selection, and enable insights into the health indicators of diabetes.

Coverage:

As a synthetic and anonymized dataset, it offers a secure environment for experimentation and learning without compromising individual privacy.

License:

CCO (Public Domain)

Who can use it:

Researchers and educators: Ideal for studies and teaching diabetes analytics and healthcare data science.

Data science enthusiasts and professionals: For practising data manipulation, feature engineering, and machine learning modelling focused on diabetes prediction.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Chandranil Chakraborttii (2025). Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems [Dataset]. http://doi.org/10.6084/m9.figshare.28878830.v1

Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.28878830.v1

Dataset updated

Apr 27, 2025

Dataset provided by

Figsharehttp://figshare.com/

Authors

Chandranil Chakraborttii

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.

Clear search

Close search

Google apps

Main menu

Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale...

Bitext Synthetic Data - Moving and storage (English language)

Synthetic dataset of hospitalised patients with an acute exacerbation of...

Synthetic Dataset of Hospital Admissions for an acute Stroke

Simpson's paradox (synthetic data).

Synthetic Heart Disease Dataset

Dataset Features:

Data Distribution and Outliers:

Usage:

Coverage:

License:

Who can use it:

DNA Digital Data Storage Report

CoSense3D

Nanopore sequencing results of a 64-bit Z-DNA encryption key and its...

Illumina sequencing reads of the Babel-DNA storage

Global DNA Data Storage Market Size By Type (Synthetic DNA-Based Storage,...

DNA Digital Data Storage Report

De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage...

Synthetic Sleep and Lifestyle Behavior Dataset

Dataset Features:

Data Distributions and Outliers

Usage:

License:

Who can use it:

Synthetic Diabetes Patient Records Dataset

Dataset Features:

Data distribution and Outliers:

Correlations and Relationships:

Usage:

Coverage:

License:

Who can use it:

Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems