Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
SDNist v2 is a Python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the NIST Diverse Communities Data Excerpts, a geographically partitioned, limited feature data set. The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Sandbox Data Generator market size reached USD 1.41 billion in 2024 and is projected to grow at a robust CAGR of 11.2% from 2025 to 2033. By the end of the forecast period, the market is expected to attain a value of USD 3.71 billion by 2033. This remarkable growth is primarily driven by the increasing demand for secure, reliable, and scalable test data generation solutions across industries such as BFSI, healthcare, and IT and telecommunications, as organizations strive to enhance their data privacy and compliance capabilities in an era of heightened regulatory scrutiny and digital transformation.
A major growth factor propelling the Sandbox Data Generator market is the intensifying focus on data privacy and regulatory compliance across global enterprises. With stringent regulations such as GDPR, CCPA, and HIPAA becoming the norm, organizations are under immense pressure to ensure that non-production environments do not expose sensitive information. Sandbox data generators, which enable the creation of realistic yet anonymized or masked data sets for testing and development, are increasingly being adopted to address these compliance challenges. Furthermore, the rise of DevOps and agile methodologies has led to a surge in demand for efficient test data management, as businesses seek to accelerate software development cycles without compromising on data security. The integration of advanced data masking, subsetting, and anonymization features within sandbox data generation platforms is therefore a critical enabler for organizations aiming to achieve both rapid innovation and regulatory adherence.
Another significant driver for the Sandbox Data Generator market is the exponential growth of digital transformation initiatives across various industry verticals. As enterprises migrate to cloud-based infrastructures and adopt advanced technologies such as AI, machine learning, and big data analytics, the need for high-quality, production-like test data has never been more acute. Sandbox data generators play a pivotal role in supporting these digital initiatives by supplying synthetic yet realistic datasets that facilitate robust testing, model training, and system validation. This, in turn, helps organizations minimize the risks associated with deploying new applications or features, while reducing the time and costs associated with traditional data provisioning methods. The rise of microservices architecture and API-driven development further amplifies the necessity for dynamic, scalable, and automated test data generation solutions.
Additionally, the proliferation of data breaches and cyber threats has underscored the importance of robust data protection strategies, further fueling the adoption of sandbox data generators. Enterprises are increasingly recognizing that using real production data in test environments can expose them to significant security vulnerabilities and compliance risks. By leveraging sandbox data generators, organizations can create safe, de-identified datasets that maintain the statistical properties of real data, enabling comprehensive testing without jeopardizing sensitive information. This trend is particularly pronounced in sectors such as BFSI and healthcare, where data sensitivity and compliance requirements are paramount. As a result, vendors are investing heavily in enhancing the security, scalability, and automation capabilities of their sandbox data generation solutions to cater to the evolving needs of these high-stakes industries.
From a regional perspective, North America is anticipated to maintain its dominance in the global Sandbox Data Generator market, driven by the presence of leading technology providers, a mature regulatory landscape, and high digital adoption rates among enterprises. However, the Asia Pacific region is poised for the fastest growth, fueled by rapid digitalization, increasing investments in IT infrastructure, and growing awareness of data privacy and compliance issues. Europe also represents a significant market, supported by stringent data protection regulations and a strong focus on innovation across key industries. As organizations worldwide continue to prioritize data security and agile development, the demand for advanced sandbox data generation solutions is expected to witness sustained growth across all major regions.
The Sandbox Data Genera
Facebook
Twitter
According to our latest research, the global Test Data Generation as a Service market size reached USD 1.36 billion in 2024, reflecting a dynamic surge in demand for efficient and scalable test data solutions. The market is expected to expand at a robust CAGR of 18.1% from 2025 to 2033, reaching a projected value of USD 5.41 billion by the end of the forecast period. This remarkable growth is primarily driven by the accelerated adoption of digital transformation initiatives, increasing complexity in software development, and the critical need for secure and compliant data management practices across industries.
One of the primary growth factors for the Test Data Generation as a Service market is the rapid digitalization of enterprises across diverse verticals. As organizations intensify their focus on delivering high-quality software products and services, the need for realistic, secure, and diverse test data has become paramount. Modern software development methodologies, such as Agile and DevOps, necessitate continuous testing cycles that depend on readily available and reliable test data. This demand is further amplified by the proliferation of cloud-native applications, microservices architectures, and the integration of artificial intelligence and machine learning in business processes. Consequently, enterprises are increasingly turning to Test Data Generation as a Service solutions to streamline their testing workflows, reduce manual effort, and accelerate time-to-market for their digital offerings.
Another significant driver propelling the market is the stringent regulatory landscape governing data privacy and security. With regulations such as GDPR, HIPAA, and CCPA becoming more prevalent, organizations face immense pressure to ensure that sensitive information is not exposed during software testing. Test Data Generation as a Service providers offer advanced data masking and anonymization capabilities, enabling enterprises to generate synthetic or de-identified data sets that comply with regulatory requirements. This not only mitigates the risk of data breaches but also fosters a culture of compliance and trust among stakeholders. Furthermore, the increasing frequency of cyber threats and data breaches has heightened the emphasis on robust security testing, further boosting the adoption of these services across sectors like BFSI, healthcare, and government.
The growing complexity of IT environments and the need for seamless integration across legacy and modern systems also contribute to the expansion of the Test Data Generation as a Service market. Enterprises are grappling with heterogeneous application landscapes, comprising on-premises, cloud, and hybrid deployments. Test Data Generation as a Service solutions offer the flexibility to generate and provision data across these environments, ensuring consistent and reliable testing outcomes. Additionally, the scalability of cloud-based offerings allows organizations to handle large volumes of test data without significant infrastructure investments, making these solutions particularly attractive for small and medium enterprises (SMEs) seeking cost-effective testing alternatives.
From a regional perspective, North America continues to dominate the Test Data Generation as a Service market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The region's leadership is attributed to the presence of major technology providers, early adoption of advanced software testing practices, and a mature regulatory environment. However, Asia Pacific is poised to exhibit the highest CAGR during the forecast period, driven by the rapid expansion of the IT and telecommunications sector, increasing digital initiatives by governments, and a burgeoning startup ecosystem. Latin America and the Middle East & Africa are also witnessing steady growth, supported by rising investments in digital infrastructure and heightened awareness about data security and compliance.
Facebook
TwitterThis dataset contains the core data to be used in projects for the textbook Introduction to Biomedical Data Science edited by Robert Hoyt MD FACP ABPM-CI, and Robert Muenchen MS PSAT (2019).
Data was genererated using Synthea, a synthetic patient generator that models the medical history of synthetic patients. Their mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable. De-identified real data still presents a challenge in the medical field because there are peopel who excel at re-identification of these data. For that reason the average medical center, etc. will not share their patient data. Most governmental data is at the hospital level. NHANES data is an exception.
You can read Synthea's first academic paper here.
284 scholarly articles cite this dataset (View in Google Scholar)
Authors: Brenda Griffith
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The NIST Excerpts Benchmark Data are a set of target data for deidentification algorithms. The data are configured to work with "SDNist: Synthetic Data Report Tool", a package for evaluating synthetic data generators: https://github.com/usnistgov/SDNist. An installation of SDNist will download the data resources automatically. Jan 2025 -- Benhcmark Excerpts: - NIST American Community Survey (ACS) Data Excerpts, 24 demographic features over 40k records, - NIST Survey of Business Owners (SBO) Data Excerpts, 130 demographic and financial features over 161k records The data are curated subsets of U.S. Census Bureau products.
Facebook
Twitter
According to our latest research, the global synthetic medical image data services market size stood at USD 452 million in 2024, reflecting robust adoption across healthcare and life sciences sectors. The market is expected to grow at a remarkable CAGR of 33.7% from 2025 to 2033, reaching a projected value of USD 5.4 billion by 2033. This exponential growth is primarily driven by the escalating demand for high-quality, diverse, and annotated medical imaging datasets to power artificial intelligence (AI) and machine learning (ML) algorithms for diagnostics, research, and training purposes. As per our comprehensive analysis, the rapid integration of synthetic data solutions is revolutionizing medical imaging workflows, enabling healthcare stakeholders to overcome data scarcity and privacy concerns while accelerating innovation.
The synthetic medical image data services market is experiencing significant growth due to the increasing need for large, annotated datasets to train and validate AI-driven diagnostic tools. Traditional approaches to medical image acquisition are often hampered by regulatory restrictions, data privacy concerns, and the inherent variability and scarcity of rare disease cases. Synthetic data generation addresses these challenges by creating realistic, customizable, and privacy-compliant datasets that enhance the performance and generalizability of AI models. Furthermore, the adoption of synthetic data accelerates the development cycle for new imaging technologies and supports the validation of medical devices, fostering a more agile and innovative healthcare ecosystem. The growing sophistication of generative adversarial networks (GANs) and other deep learning techniques has further improved the realism and utility of synthetic images, making them increasingly indispensable for modern medical imaging applications.
Another key growth factor for the synthetic medical image data services market is the rising emphasis on data privacy and compliance with regulations such as HIPAA in the United States and GDPR in Europe. These regulations impose stringent requirements on the use and sharing of patient data, often limiting the availability of real-world medical images for research and commercial purposes. Synthetic data offers a compelling solution by generating de-identified datasets that closely mimic real patient data without exposing sensitive information. This not only facilitates collaborative research and cross-institutional projects but also enables companies to scale their AI development efforts globally without the risk of data breaches or legal repercussions. As the healthcare industry continues to prioritize patient confidentiality, the demand for synthetic data services is expected to surge.
The market is further propelled by the expanding applications of synthetic medical image data in education, training, and research. Medical professionals, students, and researchers increasingly rely on diverse and complex datasets to hone their diagnostic skills, test new hypotheses, and develop innovative imaging solutions. Synthetic data bridges the gap where real-world datasets are insufficient or unavailable, providing a cost-effective and scalable alternative for simulation-based training and validation. This capability is especially valuable in regions with limited access to advanced imaging resources or rare clinical cases. As academic and research institutions intensify their focus on AI and machine learning in healthcare, synthetic data services are poised to become a cornerstone of medical education and innovation.
From a regional perspective, North America currently leads the synthetic medical image data services market, accounting for the largest share due to its advanced healthcare infrastructure, strong presence of AI technology providers, and supportive regulatory environment. Europe follows closely, driven by robust investments in digital health and a proactive stance on data privacy. The Asia Pacific region is emerging as a high-growth market, fueled by rapid digital transformation, increasing healthcare expenditure, and a burgeoning ecosystem of AI startups. Latin America and the Middle East & Africa, while still nascent, are expected to witness accelerated adoption as healthcare modernization initiatives gain momentum. Overall, the global market landscape is characterized by dynamic growth opportunities, with both developed and emerging regions contributing to the expansion of synthetic medical image da
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To generate the evidence needed to understand, improve and share what works to help refugee children learn and succeed in school, the International Rescue Committee (IRC) and NYU Global TIES for Children (TIES/NYU) established a strategic partnership, the Evidence for Action: Education in Emergencies (3EA) initiative. 3EA in Niger was designed and delivered to help strengthen the public education system in Niger and to serve refugee, IDP and host community children in the hard-hit Diffa region. It strove to achieve this through a remedial tutoring program infused with climate-targeted social-emotional learning (SEL) principles and practices (Tutoring in a Healing Classrooms), and adding skill-targeted SEL interventions (Mindfulness activities, Brain Games). Each year, the program was designed to be implemented with approximately 2000 students in second to fourth grades attending 28 Nigerien public schools across Diffa. Ninety tutors were enlisted per year to serve this group with each tutoring class averaging about 20 students. A series of cluster randomized control trials over the course of two years were held to evaluate the effectiveness of the HCT and skill-targeted SEL programming. This dataset does not contain treatment indicators. Please contact the authors for access if interested in using those variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PATRON is a human ethics approved program of research incorporating an enduring de-identified repository of Primary Care data facilitating research and knowledge generation. PATRON is a part of the 'Data for Decisions' initiative of the Department of General Practice, University of Melbourne. 'Data for Decisions' is a research initiative in partnership with general practices. It is an exciting undertaking that makes possible primary care research projects to increase knowledge and improve healthcare practices and policy. Principal Researcher: Jon EmeryData Custodian: Lena SanciData Steward: Douglas BoyleManager: Rachel CanawayMore information about Data for Decisions and utilising PATRON data is available from the Data for Decisions website.
Facebook
TwitterParticipants meeting eligibility criteria were asked to provide at least 6 mL sputum either in one or two samples collected on day 1 and day 2. Samples were homogenized, decontaminated, re-suspended in 4mL final volume for all downstream testing. MTB/RIF, acid-fast bacilli (AFB) smear, Hain MTBDRplus and MTBDRsl, Mycobacteria Growth Indicator Tube (MGIT) and Löwenstein–Jensen medium (LJ) culture were performed on the sediment for standard of care testing. MGIT pDST was performed for all culture-positive samples for RIF, INH, FQ (MFX, LFX), PZA, AMK, CAP, KAN, BDQ, LZD, CLF, STR, and EMB at WHO endorsed critical concentrations.
Facebook
TwitterThis dataset was created to be the base of the data.world SQL tutorial exercises. Data was genererated using Synthea, a synthetic patient generator that models the medical history of synthetic patients. Their mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable. De-identified real data still presents a challenge in the medical field because there are peopel who excel at re-identification of these data. For that reason the average medical center, etc. will not share their patient data. Most governmental data is at the hospital level. NHANES data is an exception.
You can read Synthea's first academic paper here.
Foto von Rubaitul Azad auf Unsplash
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To generate the evidence needed to understand, improve and share what works to help refugee children learn and succeed in school, the International Rescue Committee (IRC) and NYU Global TIES for Children (TIES/NYU) established a strategic partnership, the Evidence for Action: Education in Emergencies (3EA) initiative. In Lebanon, this program was designed and delivered to complement the Lebanese public education system and enhance learning and retention of Syrian refugee children through remedial tutoring programs infused with climate-targeted social-emotional learning (SEL) principles and practices (Tutoring in a Healing Classrooms - HCT) and skill-targeted SEL interventions (Mindfulness activities, Brain Games, 5-Component SEL Curriculum). An estimated 5000 Syrian refugee children enrolled in Lebanese public schools and teachers working with them participated in the program. These students attended 2.5-hour-long tutoring sessions per day, three times a week, with each session consisting of three lessons (Arabic, French/English, mathematics) and each lesson lasting between 30 to 40 minutes. A series of cluster randomized control trials over the course of two years were held to evaluate the effectiveness of the HCT and skill-targeted SEL programming. This dataset does not contain treatment indicators. Please contact the authors for access if interested in using those variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clinical letters contain sensitive information, limiting their use in model training, medical research, and education. This study aims to generate reliable, diverse, and de-identified synthetic clinical letters to support these tasks. We investigated multiple pre-trained language models for text masking and generation, focusing on Bio_ClinicalBERT, and applied different masking strategies. Evaluation included qualitative and quantitative assessments, downstream named entity recognition (NER) tasks, and clinically focused evaluations using BioGPT and GPT-3.5-turbo. The experiments show: (1) encoder-only models perform better than encoder–decoder models; (2) models trained on general corpora perform comparably to clinical-domain models if clinical entities are preserved; (3) preserving clinical entities and document structure aligns with the task objectives; (4) Masking strategies have a noticeable impact on the quality of synthetic clinical letters: masking stopwords has a positive impact, while masking nouns or verbs has a negative effect; (5) The BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references; (6) Contextual information has only a limited effect on the models' understanding, suggesting that synthetic letters can effectively substitute real ones in downstream NER tasks; (7) Although the model occasionally generates hallucinated content, it appears to have little effect on overall clinical performance. Unlike previous research, which primarily focuses on reconstructing original letters by training language models, this paper provides a foundational framework for generating diverse, de-identified clinical letters. It offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain. Our codes and trained models are available at https://github.com/HECTA-UoM/Synthetic4Health.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summarized (deidentified) qualitative data for Specific Aim (SA) 2 of K18MD019159 (PI: Goldstein). One accompanying file is enclosed, which describes the methods and interview questions used to generate the summarized data. Please see the following for additional information about the summarized qualitative data provided in this dataset: Goldstein, E.V., Sanger, A. & Hill, J.L. Firearm experiences and safe storage challenges among a sample of Black adults: a rapid qualitative analysis. Inj. Epidemiol. 12, 79 (2025). https://doi.org/10.1186/s40621-025-00634-5
Facebook
TwitterThis dataset was collected from first-generation immigrants between 2022 and 2023. Over a 28-day period, 39 participants aged 18 to 65, fluent in English and experiencing loneliness (UCLA Loneliness Scale score ≥ 28) contributed to the study. Data collection utilized Samsung Watch Active 2, Oura Ring, AWARE, and Centralive smartphone application. This dataset contains raw data from photoplethysmogram (PPG), inertial measurement unit (IMU) readings, air pressure, and processed data on heart rate, heart rate variability, sleep metrics (bedtime, stages, quality), physical activity (steps, active calories, activity types), and smartphone usage patterns (screen time, notifications, call and message logs). Participants also completed ecological momentary assessments (EMA) and weekly surveys, including instruments like the Beck Depression Inventory (BDI), Patient Health Questionnaire-9 (PHQ-9), Perceived Stress Scale, Sense of Coherence Scale, Social Connectedness Scale, Twente Engagement with..., Design and set up This study was designed to create a longitudinal dataset capturing physiological, behavioral, and psychological data from first-generation immigrants living in Finland. The dataset aims to support research on the relationship between mental health and daily lifestyle factors, providing a foundation for further detection algorithm development. To achieve this, the study collected multimodal data over a 28-day period from every participant. Objective data were gathered from wearable devices, which recorded sleep patterns, physical activity, and cardiovascular health metrics and raw PPG signals. Passive smartphone data, such as screen usage, notifications, calls, and messages, were also collected to capture digital behavior patterns. Subjective data were collected through EMAs delivered via push notifications and weekly self-report surveys. These assessments measured daily emotional states—loneliness, stress, depression, and social connectedness. By integrating multiple d..., , # Loneliness and well-being in Finnish immigrants: A multimodal dataset from wearables and passive data collection
The dataset consists of longitudinal physiological, behavioral, and self-reported data collected from first-generation immigrants in Finland during 2022 and 2023. The study included 39 participants aged 18–65, all fluent in English and experiencing loneliness (UCLA Loneliness Scale score ≥28). Data were collected over a 28-day period using multimodal sources, including the Samsung Watch Active 2, Oura Ring, and the AWARE smartphone application.
The dataset includes raw and processed data on cardiovascular health, sleep patterns, physical activity, smartphone usage, and mental health assessments. Daily and weekly ecological momentary assessments (EMA) captured momentary emotional states, while structured surveys administered through Centralive provided insights into participants' mental health and well-being.
At the root of the dat..., All participants provided written informed consent to share their de-identified data for public research purposes at the time of enrollment.
To protect participant privacy and minimize the risk of re-identification, we applied the following de-identification procedures:
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global market size for Pseudonymized Sandboxes for Data Science reached USD 1.14 billion in 2024, reflecting a robust demand for secure and privacy-compliant data environments. The market is growing at a CAGR of 17.2% and is projected to reach USD 5.03 billion by 2033. This remarkable growth is primarily driven by increasing regulatory requirements for data privacy, the proliferation of sensitive data across industries, and the rising adoption of advanced analytics and artificial intelligence in business operations.
The surge in data privacy regulations such as GDPR, HIPAA, and CCPA has become a significant growth driver for the Pseudonymized Sandboxes for Data Science market. Enterprises are under immense pressure to ensure that their data science and AI initiatives do not compromise personal or sensitive information. Pseudonymized sandboxes provide a secure, controlled environment where data scientists can work with de-identified data, minimizing the risk of data breaches and unauthorized access. This approach enables organizations to maintain compliance while accelerating analytics-driven innovation, making these sandboxes indispensable in regulated sectors such as healthcare, finance, and government. The demand is further amplified by the increasing frequency of cyber threats and the need for robust data governance frameworks.
Another key factor fueling the market’s expansion is the exponential growth of big data and the adoption of cloud-based analytics solutions. As businesses generate and collect vast amounts of data, the need to analyze this information without exposing sensitive details has become paramount. Pseudonymized sandboxes offer a pragmatic solution, allowing organizations to leverage data for advanced analytics, machine learning, and AI model training while safeguarding privacy. The flexibility to deploy these sandboxes either on-premises or in the cloud caters to diverse enterprise needs, supporting scalability and cost-efficiency. This capability is especially attractive to industries like retail and IT & telecom, where rapid innovation and customer-centricity are critical.
The market is also benefiting from the increasing collaboration between data science teams and business units. As organizations strive to become more data-driven, cross-functional teams require access to data without violating privacy norms. Pseudonymized sandboxes enable secure data sharing and experimentation, fostering a culture of innovation. Additionally, advances in pseudonymization technologies, such as tokenization and differential privacy, are enhancing the effectiveness and reliability of these sandboxes. The integration of automation and AI-driven data masking further streamlines the process, reducing manual intervention and operational risk. These trends collectively contribute to the sustained growth and adoption of pseudonymized sandboxes across various sectors.
Regionally, North America dominates the Pseudonymized Sandboxes for Data Science market, accounting for the largest revenue share in 2024, followed by Europe and Asia Pacific. The presence of stringent regulatory frameworks, mature data science ecosystems, and a high concentration of technology-driven enterprises are key factors underpinning North America’s leadership. Meanwhile, Asia Pacific is witnessing the fastest growth, driven by rapid digitalization, increasing awareness of data privacy, and government initiatives to enhance cybersecurity. Europe’s growth is anchored in its robust regulatory landscape and strong emphasis on data protection, while Latin America and the Middle East & Africa are gradually embracing pseudonymized sandboxes as digital transformation accelerates in these regions.
The Pseudonymized Sandboxes for Data Science market is segmented by component into software and services. The software segment comprises the core platforms and tools that enable pseudonymization, data masking, tokenization, and sandboxing functionalities. This segment is witnessing significant growth as organizations increasingly invest in advanced software solutions to automate and streamline their data privacy processes. Modern pseudonymization software leverages artificial intelligence and machine learning to enhance data security, ensure regulatory compliance, and facilitate seamless integration with existing analytics infrastructure. The ab
Facebook
TwitterThis study includes a synthetically-generated version of the Ministry of Justice Data First Probation datasets. Synthetic versions of all 43 tables in the MoJ Data First data ecosystem have been created. These versions can be used / joined in the same way as the real datasets. As well as underpinning training, synthetic datasets should enable researchers to explore research questions and to design research proposals prior to submitting these for approval. The code created during this exploration and design process should then enable initial results to be obtained as soon as data access is granted.
The Ministry of Justice Data First probation dataset provides data on people under the supervision of the probation service in England and Wales from 2014. This is a statutory criminal justice service that supervises high-risk offenders released into the community. The data has been extracted from the management information system national Delius (nDelius), used by His Majesty's Prisons and Probation Service (HMPPS) to manage people on probation.
Information is included on service users' characteristics and offence, and on their pre-sentence reports, sentence requirements, licence conditions and post-sentence supervision; for example, age, gender, ethnicity, offence category, key dates relating to sentence and recalls, activities and programmes required as part of rehabilitation (e.g. drug and alcohol treatment, skills training) and limitations set on their activities (e.g. curfew, location monitoring, drugs testing).
Each record in the dataset gives information about a single person and probation journey. As part of Data First, records have been deidentified and deduplicated, using our probabilistic record linkage package, Splink, so that a unique identifier is assigned to all records believed to relate to the same person, allowing for longitudinal analysis and investigation of repeat interactions with probation. This aims to improve on links already made within probation services. This opens up the potential to better understand probation service users and address questions on, for example, what works to reduce reoffending.
The Ministry of Justice Data First linking dataset can be used in combination with this and other Data First datasets to join up administrative records about people from across justice services (courts, prisons and probation) to increase understanding around users' interactions, pathways and outcomes.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.
Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.
MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.
The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.
Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.
This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.
Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.
The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180
Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.
CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.
DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/
Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.
Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.
Conflicts of Interest The authors declare no competing financial interests.
References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Analytic code and deidentified data set used to generate the findings presented in the manuscript "Economic Outcomes Among Microfinance Group Members Receiving Integrated HIV Care: Cluster Randomized Trial Evidence From Kenya"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All data used to generate the findings of this study. (CSV)
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
SDNist v2 is a Python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist supports using the NIST Diverse Communities Data Excerpts, a geographically partitioned, limited feature data set. The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.