Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English language (see ELRA-L0162 to ELRA-L0181). They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. The Moving and storage domain comprises 29 intents for English.Data is distributed as models or open text files.
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
To support respiratory research, a synthetic asthma dataset was generated based on a real-world data, originally documenting 381 patients with physician-confirmed asthma who were admitted to secondary care at a single centre in 2019. The dataset is highly detailed, covering demographics, structured physiological data, medication records, and clinical outcomes. The synthetic version extends to 561 patients admitted over a year, offering insights into patient patterns, risk factors, and treatment strategies.
The dataset was created using the Synthetic Data Vault package, specifically employing the GAN synthesizer. Real data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.
Geography: The West Midlands has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UHB runs a fully electronic healthcare record (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.
Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.
Available supplementary data: Real world data. Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can provide real-world data upon request.
Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Strokes can be ischaemic or haemorrhagic in nature, leading to debilitating symptoms which are dependent on the location of the stroke in the brain and the severity of the insult. Stroke care is centred around Hyper-acute Stroke Units (HASU), Acute Stroke and Brain Injury Units (ASU/ABIU) and specialist stroke services. Early presentation enables the use of more invasive treatments to clear blood clots, but commonly strokes present late, preventing their use.
This synthetic dataset represents approximately 29,000 stroke patients. Data includes demography, socioeconomic status, co-morbidities, “time stamped” serial acuity, physiology and treatments, investigations (structured and unstructured data), hospital care processes, and outcomes.
The dataset was created using the Synthetic Data Vault (SDV) package, specifically employing the GAN synthesizer. Real. data was first read and pre-processed, ensuring datetime columns were correctly parsed and identifiers were handled as strings. Metadata was defined to capture the schema, specifying field types and primary keys. This metadata guided the synthesizer in understanding the structure of the data. The GAN synthesizer was then fitted to the real data, learning the distributions and dependencies within. After fitting, the synthesizer generated synthetic data that mirrors the statistical properties and relationships of the original dataset.
Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute stroke services & specialist care across four hospital sites.
Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.
Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can build synthetic data to meet bespoke requirements.
Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two treatments (A and B) were applied in two groups (1 and 2) of patients. Treatment A seems to be more successful in each of the groups viewed separately (100 > 87.5 and 66.7 > 50). However, evaluated for the combined group of patients, treatment B appears to be more successful (75 < 80).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A synthetic heart disease dataset has been generated to serve as an educational resource for data science, machine learning, and data analysis applications in the healthcare industry. It simulates patient records related to heart disease, allowing users to practice data manipulation and develop analytical skills in a healthcare context.
https://storage.googleapis.com/opendatabay_public/images/image_88c9876e-c5a3-48be-837e-f1ea77d11693.png" alt="Synthetic Heart Disease Data">
https://storage.googleapis.com/opendatabay_public/images/image_041922c7-f3dc-49c9-bfbf-16cdf98d6bd8.png" alt="Synthetic Heart Disease Patient Records Dataset">
https://storage.googleapis.com/opendatabay_public/images/hearr_disease_09f51ed4-86d0-4ac4-b6c0-b7b376a9f7f2.png" alt="Synthetic Heart Disease Statistics">
https://storage.googleapis.com/opendatabay_public/images/heart_disease3_abb20b90-1bbd-4e2c-87ce-a47f1e414583.png" alt="Synthetic Heart Disease Data Distribution">
https://storage.googleapis.com/opendatabay_public/images/heart_disease4_64b65bf1-9b53-4ab1-a7ea-3486c050f607.png" alt="Synthetic Heart Disease Dataset Heatmap and Correlation">
This dataset can be used for: - Healthcare research: To explore trends and patterns in cardiovascular health, treatment efficacy, and patient demographics. - Educational training: To teach data cleaning, transformation, and visualisation techniques specific to healthcare data. - Predictive modelling: To develop models that predict heart disease risk based on various patient and demographic factors.
This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.
CCO (Public Domain)
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The DNA digital data storage market is experiencing rapid growth, driven by the increasing demand for long-term, secure, and high-capacity data storage solutions. The market's inherent advantages over traditional methods, such as significantly higher storage density and exceptional longevity, are fueling its expansion. While currently a niche market, projections suggest substantial growth over the next decade. Key players like Twist Bioscience, Illumina, and Western Digital are actively investing in R&D and infrastructure development, leading to advancements in encoding techniques and cost reduction. The market is segmented by technology (e.g., synthetic DNA, natural DNA), application (e.g., archival storage, cold storage), and geographic region. North America and Europe currently dominate the market share, benefiting from robust research ecosystems and early adoption, but Asia-Pacific is poised for significant expansion due to its growing data centers and increasing technological investments. Challenges include the relatively high cost of DNA synthesis and sequencing, along with technological hurdles associated with error correction and data retrieval. Despite these challenges, ongoing research focuses on streamlining the synthesis and sequencing processes, reducing costs, and improving error-correction capabilities. This is resulting in increased accessibility of this technology to a wider range of users. Government initiatives supporting the development of advanced data storage solutions are further bolstering market growth. The forecast period (2025-2033) anticipates a strong CAGR, driven by factors such as increasing data generation across various sectors (healthcare, genomics, finance) and the growing need for secure and sustainable data archiving. The longer-term outlook remains positive, with the potential for DNA data storage to become a mainstream technology in the coming decades, revolutionizing data management and storage capacity across industries.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Repo provides all datasets or their external links for the project CoSense3D. The related datasets are:
COMAP: A synthetic data generated by CARLA for cooperative perception.
OPV2Vt: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V for the purpose of globally time-aligned cooperative object detection (TA-COOD). The original replay files are interpolated to obtain the object and sensor locations at sub-frames. Each frame is spitted into 10 sub-frames for simulation.
DairV2Xt: New generate meta files based on dataset DAIR-V2X for the project CoSense3D with localization correction and ground truth generate for TA-COOD.
OPV2Va: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V augmented with semantic labels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nanopore sequencing results of a 64-bit Z-DNA encryption key and its amplified counterpart.File 64-bit-key.fq.gzThis file contains the Nanopore sequencing results of the original 64-bit Z-DNA key.File 64-bit-key-PCR.fq.gzThis file includes the Nanopore sequencing results of the PCR-amplified sample derived from the original 64-bit Z-DNA key.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sequencing reads of Babel-DNA storage.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
DNA Data Storage Market size was valued at USD 126.76 Million in 2024 and is projected to reach USD 6,241.39 Million by 2032, growing at a CAGR of 74.48% from 2026 to 2032.
Global DNA Data Storage Market Overview
The exponential growth in global data generation is shaping new trends in storage technology. With the proliferation of artificial intelligence (AI), the Internet of Things (IoT), high-resolution video content, and expansive cloud services, global data volume is projected to exceed 149 zettabytes by 2024. This data explosion is pushing current storage infrastructures like hard drives and data centers beyond their capacity due to their limitations in scalability, energy efficiency, and sustainability. In response, the tech industry is increasingly turning toward innovative storage mediums, with DNA data storage gaining significant traction for its potential to offer high density, long lifespan, and minimal energy requirements.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The DNA digital data storage market is projected to experience significant growth over the forecast period, with a CAGR of XX%. The market is driven by the increasing demand for data storage solutions that are more reliable, efficient, and cost-effective than traditional methods. DNA digital data storage offers a unique combination of advantages, including its ability to store vast amounts of data in a compact form, its long-term stability, and its resistance to environmental factors. Key drivers of the market growth include the increasing demand for data storage in various industries, such as healthcare, finance, and manufacturing, as well as the technological advancements in DNA sequencing and synthesis techniques. The growing adoption of cloud computing and the Internet of Things (IoT) is also driving the demand for scalable and cost-effective data storage solutions. The market is segmented by application, type, and region. The major applications of DNA digital data storage include healthcare, research, and enterprise data storage. The types of DNA digital data storage include synthetic DNA and natural DNA. The major companies operating in the market include Twist Bioscience, Western Digital, Microsoft, Illumina, Thermo Fisher Scientific, Siemens, Beckman Coulter, F. Hoffmann-La Roche, and Catalog.
These files contain the references for the oligos used in the synthetic data set tested in the paper :
"De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing"
These oligo have been generated from images from the project ConCluD https://gitlab.inria.fr/pim/org.pim.dnarxiv
using paper_scripts/datasets/IM-1-10-100/img_to_oligos.py
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Sleep and Lifestyle Behavior Dataset is a synthetic dataset designed to provide insights into the relationship between daily habits and sleep health. It comprises 100,000 rows and 14 columns, offering a comprehensive view of various lifestyle and health factors. This dataset includes key variables such as gender, age, occupation, sleep patterns, physical activity, stress levels, and cardiovascular health metrics, as well as the presence of sleep disorders.
https://storage.googleapis.com/opendatabay_public/images/download_94f7fbeb-90ac-4ae7-acb7-3e9e27fa20cf.png" alt="Synthetic Sleep and Lifestyle Behavior Data">
https://storage.googleapis.com/opendatabay_public/images/download_(1)_a67becb2-c287-4d74-8215-5926a870221b.png" alt="Synthetic Sleep and Lifestyle Behaviour Data Distribution ">
https://storage.googleapis.com/opendatabay_public/images/download_(2)_5d7db1cd-d15a-40b8-b484-8cda6bf09685.png" alt="Synthetic Sleep and Lifestyle Behavioral Stats">
https://storage.googleapis.com/opendatabay_public/images/download_(3)_e0c63a0b-ffac-45d6-b42a-a011dc17a297.png" alt="Synthetic Sleep and Lifestyle Behavior Dataset Matrics">
This dataset can be used for:
Healthcare research: Investigate the relationships between daily habits, lifestyle factors, and sleep health, and explore trends in the occurrence of sleep disorders. Educational training: Use it for teaching data analysis, machine learning, and statistical techniques, with a focus on health and wellness. Predictive modelling: Build models to predict sleep quality, the likelihood of sleep disorders, or cardiovascular health based on daily activities and stress levels.. Coverage: This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising real patient privacy.
CCO (Public Domain)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Diabetes Dataset is a comprehensive resource designed to support researchers, data scientists, and healthcare professionals interested in diabetes risk assessment and prediction. With a broad spectrum of health-related attributes, this dataset is ideal for developing predictive models and exploring factors associated with diabetes risk. By providing this dataset, we aim to encourage collaboration and innovation in data science and healthcare, potentially leading to more accurate early diagnoses and personalized diabetes treatment strategies.
https://storage.googleapis.com/opendatabay_public/images/image_61dbe587-4a7d-4307-99ab-e46e63ca0e5b.png" alt="Synthetic Diabetes Patient Records Data">
https://storage.googleapis.com/opendatabay_public/images/image_2d5fea3b-d555-4d4a-b109-0e350ae156d5.png" alt="Synthetic Diabetes Patient Records Distribution">
https://storage.googleapis.com/opendatabay_public/images/image_3c7351ac-01c8-489c-823d-bae6ca7fe202.png" alt="Synthetic Diabetes Patient Records Data Correlation">
https://storage.googleapis.com/opendatabay_public/images/diabetes2_85a01003-2848-495d-89d9-2d6a21fd77c0.png" alt="Synthetic Diabetes Patient Records Statistic">
https://storage.googleapis.com/opendatabay_public/images/diabetes_copy_f0b0dfc6-56de-42ed-851a-249affebe105.jpg" alt="Synthetic Diabetes Patient Records EMR">
This dataset can be used for:
As a synthetic and anonymized dataset, it offers a secure environment for experimentation and learning without compromising individual privacy.
CCO (Public Domain)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.