22 datasets found
  1. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  2. Number of licensed day care center slots per 1,000 children aged 0-5 years

    • healthdata.gov
    • data.ca.gov
    • +3more
    application/rdfxml +5
    Updated Apr 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chhs.data.ca.gov (2025). Number of licensed day care center slots per 1,000 children aged 0-5 years [Dataset]. https://healthdata.gov/State/Number-of-licensed-day-care-center-slots-per-1-000/35q2-2jky/data
    Explore at:
    csv, tsv, application/rssxml, xml, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    chhs.data.ca.gov
    Description

    This table contains data on the number of licensed day care center slots (facility capacity) per 1,000 children aged 0-5 years in California, its regions, counties, cities, towns, and census tracts. The table contains 2015 data, and includes type of facility (day care center or infant center). Access to child care has become a critical support for working families. Many working families find high-quality child care unaffordable, and the increasing cost of child care can be crippling for low-income families and single parents. These barriers can impact parental choices of child care. Increased availability of child care facilities can positively impact families by providing more choices of child care in terms of price and quality. Estimates for this indicator are provided for the total population, and are not available by race/ethnicity. More information on the data table and a data dictionary can be found in the Data and Resources section. The licensed day care centers table is part of a series of indicators in the Healthy Communities Data and Indicators Project (HCI) of the Office of Health Equity. The goal of HCI is to enhance public health by providing data, a standardized set of statistical measures, and tools that a broad array of sectors can use for planning healthy communities and evaluating the impact of plans, projects, policy, and environmental changes on community health. The creation of healthy social, economic, and physical environments that promote healthy behaviors and healthy outcomes requires coordination and collaboration across multiple sectors, including transportation, housing, education, agriculture and others. Statistical metrics, or indicators, are needed to help local, regional, and state public health and partner agencies assess community environments and plan for healthy communities that optimize public health. More information on HCI can be found here: https://www.cdph.ca.gov/Programs/OHE/CDPH%20Document%20Library/Accessible%202%20CDPH_Healthy_Community_Indicators1pager5-16-12.pdf

    The format of the licensed day care centers table is based on the standardized data format for all HCI indicators. As a result, this data table contains certain variables used in the HCI project (e.g., indicator ID, and indicator definition). Some of these variables may contain the same value for all observations.

  3. c

    New Lung Lesions in Low-dose CT: a newly annotated longitudinal dataset...

    • cancerimagingarchive.net
    csv, n/a, nifti, xlsx
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). New Lung Lesions in Low-dose CT: a newly annotated longitudinal dataset derived from the National Lung Screening Trial [Dataset]. http://doi.org/10.7937/eyvh-ag54
    Explore at:
    xlsx, csv, nifti, n/aAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Jul 8, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    The National Lung Screening Trial is an influential publicly available medical image dataset that has fueled a breadth of work in lesion detection. Strengths of this trial include its multi-institutional nature and the standardization of data collection. One limitation of the original dataset, however, is the lack of image labels such as lesion annotations. We introduce an annotated derivative dataset of 152 lung lesions in 126 scans from 122 participants. These lesions were identified by radiologists during the trial as new compared to prior imaging and suspicious for malignancy. We provide point annotations, image coordinates, and registered baseline CT images for each identified new lesion. This addresses a gap in labeled longitudinal public datasets that capture the development of new lesions, and supports the development of automated tools in lesion detection that leverage temporal information in longitudinal imaging studies.

    Introduction

    The detection of new lesions in cross-sectional imaging is a time- and resource-intensive task in cancer imaging and has a pivotal role in a variety of clinical applications, including lung cancer screening. Automated tools have enormous potential to improve the efficacy and efficiency of new lesion detection in clinical practice; however, there remains a gap in labeled longitudinal public image datasets that are critical in the development and evaluation of such tools. While public data are available to train convolutional neural networks (CNNs) for lesion detection at a single time point, there is relative paucity of large, annotated, longitudinal public datasets (multiple timepoints) with new lesions from a lesion-negative baseline image. One essential strength of the https://doi.org/10.7937/TCIA.HMQ8-J677" rel="nofollow">National Lung Screening Trial (NLST) (see also 10.1056/NEJMoa1102873 , cdas nlst ) was the standardized collection of clinical, image, and lesion data, in addition to the size and multi-institutional nature of the trial. We utilized standardized characteristics of lesions that were identified by radiologists during the trial to select a subset of lesions that were marked as new compared to prior image timepoints. We provide point annotations for each lesion in this subset, as well as the corresponding baseline CT image that is registered to the follow-up time point of interest for ease of comparison. By capturing the development of new lesions, our newly annotated dataset helps to address the gap in labeled longitudinal public image datasets; this may support the development of automated tools in lesion detection that leverage temporal information in longitudinal (multi-time point) imaging studies.

    Methods

    The following subsections provide information about how the data were selected, acquired and prepared for publication, approximate date range of imaging studies.

    Subject Inclusion and Exclusion Criteria

    To identify participants, scans, and lesions of interest, we utilized standardized clinical datasets (described by dataset dictionaries) that are available in the original TCIA collection (see also 10.7937/TCIA.HMQ8-J677). This clinical dataset describes abnormalities on low-dose CT that were identified by radiologists during the trial, and include lesion location, size, attenuation, and findings on comparison with prior scans. The 122 participants in the final cohort with annotated new lesions were 61.8 +/- 5.1 (mean +/- S.D.) years of age; 57 were male and 65 were female. 116 (95.1%), 4 (3.2%), 1 (0.8%), and 1 (0.8%) participants reported their race as “White”, “Black”, “Native Hawaiian or Other Pacific Islander”, and “More than one race”, respectively; none reported their race as “Asian”.

    Data Analysis

    Lesion Selection: Overview

    To identify lesions of interest, we used a subset of clinical variables defined by the original trial in data dictionaries (Figure 1). We included lesions that were identified on diagnostic-quality CT scans (ctdxqual). Of three screening timepoints (study years 0, 1, and 2), we selected the subset of lesions that were identified on study years 1 or 2 (follow-up), that were new compared to imaging acquired earlier in the study (does not preexist) and included a specified slice number and lung lobe (sct_slice_num and sct_epi_loc). All lesions in this subset were “Non-calcified nodule or mass (opacity >= 4 mm diameter)", as slice number was only recorded for lesions of this type. We characterized the selected lesion subset by longest and perpendicular diameter in the indicated slice (sct_long_dia and sct_perp_dia), lesion margins (sct_margins e.g. spiculated, smooth, poorly defined), and attenuation/ subtype (sct_pre_att e.g. “ground glass”, “soft tissue”, “mixed”). We describe participant demographics of this cohort by age, sex, and race, based on clinical data available on the TCIA collection.

    Lesion Selection: Detailed

    At the lesion-level, of 177,487 total abnormalities were identified in 24,517 participants throughout the study. 59,283 and 60,438 abnormalities were documented at T1 and T2, respectively; of those, 11,726 and 11,892 abnormalities were non-calcified nodules > 4mm in longest diameter, all of which included a documented slice number and lung lobe. Of those nodules with documented slice number, 1372 and 1282 were new compared to prior imaging (not preexisting) at T1 and T2, respectively; of these, 2587 (97.5%) lesions were identified on scans that were of diagnostic quality. We then apply criteria at the level of the screening time point. Of 26453 participants in the low-dose CT arm of the trial, 954 (1.8%) participants had a screen that was positive and suspicious for lung cancer at T1 and none had positive screens suspicious for lung cancer at T2. Of the 954 scans at T1, 749 (78.5%) were of diagnostic quality. Notably, of 75138 total scans, 73062 scans (97.2%) were of diagnostic quality with similar proportions at T1. 194 lesions in 152 participants that satisfied all lesion- and screen-level criteria. A subset of these were included in the final set due to the number of scans available on download from the original NLST collection, as well as lesions that were described in the clinical dataset but not identifiable in the labeled image. Participants with missing screening time points were excluded from our dataset to ensure that the CT images used were accurately correlated to the time points and lesions described in the clinical datasets. 170 lesions in 132 participants had all expected time points available on download from TCIA. Of those, 152 lesions in 122 participants were identified on the CT series indicated by the clinical dataset and were subsequently annotated with a point marking. The selection of 152 annotated new lesions in our derivative dataset as described in the figure was determined by three sets criteria: lesion-level, scan-level, and data availability. “missing data on download” refers to the exclusion of CT data from patients that were missing CT data from at least one of three imaging timepoints; these timepoints were documented in the clinical data forms, but not present on download from the original collection. Mis-labeled data refers to CT images whose files correlated with screening timepoint descriptors in the clinical data forms but were inconsistent with described lesions, when reviewed by expert radiologist KR.

    Lesion Characteristics

    103 (67.8%), 34 (22.4%), 14 (9.2%), and 1 (0.7%) of the lesions were described as soft tissue, ground glass, mixed, or other; there were no fat or fluid/water lesions in this derivative dataset. The longest diameter for the annotated lesions was 9.4 +/- 5.8mm (mean +/- std) and ranged from 4.0 to 41.0 mm (Table 1 under Detailed Description). All lesions in this subset had a longest diameter > 4.0 mm, per selection criteria.

    Selection of CT reconstruction filter

    We used the CT reconstruction filter specified in the DICOM header to select one image series from each timepoint for registration. Standard reconstruction (including GE Standard", "Phillips C”, "Siemens B30", and "Toshiba FC10") was selected when available to improve the quality of registration and optimize homogeneity in image reconstruction parameters in the provided registered baseline images. Importantly, the choice of reconstruction kernel does not impact the new lesion image coordinates for this dataset. Rather, consistent reconstruction filters between baseline and follow-up improves the quality of registration between pairs of timepoints. Homogeneity in image reconstruction parameters across baseline images for this dataset may also benefit downstream development of automated tools using this dataset.

    Image Annotations

    New lesion image coordinates were manually identified and marked on each image by A.G. and K.R. At the time of annotation, A.G. was senior medical student with oversight from board-certified radiologist, K.R., who has fellowships in abdominal and thoracic imaging and 8 years of clinical experience. Point annotations were made using the open-source image segmentation software http://www.itksnap.org/" rel="nofollow">ITK-SNAP.

    Image Preprocessing and Registration with Baseline CT

    For each CT scan containing a new lesion (follow-up), we identified the baseline CT

  4. d

    Data from: Using a standardized sound set to help characterize misophonia:...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline Trumbull (2024). Using a standardized sound set to help characterize misophonia: The international affective digitized sounds [Dataset]. http://doi.org/10.5061/dryad.gtht76hv4
    Explore at:
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jacqueline Trumbull
    Description

    Misophonia is a condition characterized by negative affect, intolerance, and functional impairment in response to particular repetitive sounds usually made by others (e.g., chewing, sniffing, pen tapping) and associated stimuli. To date, researchers have largely studied misophonia using self-report measures. As the field is quickly expanding, assessment approaches need to advance to include more objective measures capable of differentiating those with and without misophonia. Although several studies have used sounds as experimental stimuli, few have used standardized stimuli sets with demonstrated reliability or validity. To conduct rigorous research to better understand misophonia, it is important to have an easily accessible, standardized set of acoustic stimuli for use across studies. Accordingly, in the present study, the International Affective Digitized Sounds (IADS-2), developed by Bradley and Lang [1], were used to determine whether participants with misophonia responded to cert..., Group differences in sound ratings were examined using a two-way, mixed analysis of covariance (2 groups x 3 sound types, where “group†corresponds to participants with misophonia or controls, and “sound type†refers to positive, negative, or neutral sounds) on four dependent variables (ratings of valence, arousal, similarity, and avoidance). When statistically significant interactions were observed for sound type, pairwise comparisons were used to determine group differences on each dependent variable, as well as mean differences between sound type on each dependent variable. All analyses were conducted using IBM SPSS27 statistical software. The first step in the data analytic plan included cleaning and screening the dataset by (a) inspecting all variables for data entry errors (none were observed), and (b) examining the normality of distributions across study variables. Next, bivariate correlations were explored to examine the relationships among variables and determine whether it wou..., , # Using a standardized sound set to help characterize misophonia: The international affective digitized sounds

    https://doi.org/10.5061/dryad.kh18932fd

    Description of the data and file structure

    MQincluded is the group variable. MQincluded=1 describes all participants who meet misophonia criteria and were included in the dataset. MQincluded=0 describes healthy controls. All variable names for the measures have descriptors in the "label" column of SPSS. Average ratings for the dependent variables are found at the end of the variable view in SPSS, as well as PANAS positive and negative scores and the AIM total score.

    Sharing/Access information

    Data was derived from the following sources:

    • Amazon Mechanical Turk

    Code/Software

    SPSS syntax is included with the data upload.Â

  5. CURVAS dataset

    • zenodo.org
    zip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meritxell Riera-Marín; Meritxell Riera-Marín; Joy-Marie Kleiß; Anton Aubanell; Anton Aubanell; Andreu Antolín; Andreu Antolín; Joy-Marie Kleiß (2024). CURVAS dataset [Dataset]. http://doi.org/10.5281/zenodo.11147560
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Meritxell Riera-Marín; Meritxell Riera-Marín; Joy-Marie Kleiß; Anton Aubanell; Anton Aubanell; Andreu Antolín; Andreu Antolín; Joy-Marie Kleiß
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clinical Problem

    In medical imaging, DL models are often tasked with delineating structures or abnormalities within complex anatomical structures, such as tumors, blood vessels, or organs. Uncertainty arises from the inherent complexity and variability of these structures, leading to challenges in precisely defining their boundaries. This uncertainty is further compounded by interrater variability, as different medical experts may have varying opinions on where the true boundaries lie. DL models must grapple with these discrepancies, leading to inconsistencies in segmentation results across different annotators and potentially impacting diagnosis and treatment decisions. Addressing interrater variability in DL for medical segmentation involves the development of robust algorithms capable of capturing and quantifying uncertainty, as well as standardizing annotation practices and promoting collaboration among medical experts to reduce variability and improve the reliability of DL-based medical image analysis. Interrater variability poses significant challenges in the field of DL for medical image segmentation.

    Furthermore, achieving model calibration, a fundamental aspect of reliable predictions, becomes notably challenging when dealing with multiple classes and raters. Calibration is pivotal for ensuring that predicted probabilities align with the true likelihood of events, enhancing the model's reliability. It must be considered that, even if not clearly, having multiple classes account for uncertainties arising from their interactions. Moreover, incorporating annotations from multiple raters adds another layer of complexity, as differing expert opinions may contribute to a broader spectrum of variability and computational complexity.

    Consequently, the development of robust algorithms capable of effectively capturing and quantifying variability and uncertainty, while also accommodating the nuances of multi-class and multi-rater scenarios, becomes imperative. Striking a balance between model calibration, accurate segmentation and handling variability in medical annotations is crucial for the success and reliability of DL-based medical image analysis.

    CURVAS Challenge Goal

    Due to all the previously stated reasons, we have created a challenge that considers all of the above. In this challenge, we will work with abdominal CT scans. Each of them will have three different annotations obtained from different experts and each of the annotations will have three classes: pancreas, kidney and liver.

    The main idea is to be able to evaluate the results considering the multi rater information. There will be three separate evaluations: firstly, a classical dice score evaluation together with an uncertainty study will be performed; secondly, a volumetric assessment to give relevant clinical information will take place; finally, a study on whether the model is calibrated or not will take place. All of these evaluations will be performed considering all three different annotations.
    For more information about the challenge, visit our website to join CURVAS (Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation). This challenge will be held in MICCAI 2024.

    Dataset Cohort

    The challenge cohort consists of 90 CT images prospectively gathered at the University Hospital Erlangen between August 2023 and October 2023. Each CT will have multiple classes: background (0), pancreas (1), kidney (2) and liver (3). In addition, each of the CTs will have three different annotators from three different experts that will contain the four classes specified previously.

    • Training Phase cohort:

    20 CT scans belonging to group A with the respective annotations will be given. It is encouraged to leverage publicly available external data annotated by multiple raters. The idea of giving a small amount of data for the training set and giving the opportunity of using a public dataset for training is to make the challenge more inclusive, giving the option to develop a method by using data that is in anyone's hands. Furthermore, by using this data to train and using other data to evaluate, it makes it more robust to shifts and other sources of variability between datasets.

    • Validation Phase cohort:

    5 CT scans belonging to group A will be used for this phase.

    • Test Phase cohort:

    65 CT scans will be used for evaluation. 20 CTs belonging to group A, 22 CTs belonging to group B and 23 CTs belonging to group C.

    Both validation and testing CT scans cohorts will not be published until the end of the challenge. Furthermore, to which group each CT scan belongs will not be revealed until after the challenge.

    Clinical Specifications

    Inclusion criteria were a maximum of 10 cysts with a diameter of less than 2,0 cm. Furthermore, CT scans with major artifacts (e.g. breathing artifacts) or incomplete registrations were excluded.

    Participants were required to be over 18 years old and provide both verbal and written consent for the use of their CT images in the Challenge. Both study-specific and broad consent were obtained. Among the 90 patients, there were 51 males and 39 females, aged between 37 and 94 years, with an average age of 65.7 years. All patients received treatment at the University Hospital Erlangen in Bavaria, Germany. No additional selection criteria was set to ensure a representative sample of a typical patient cohort.

    Our overall data consists on 90 CTs splitted in three different groups:

    • Group A: cases with 2 cysts or less with no contour altering pathologies - 45 CTs

    • Group B: cases with 3-5 cysts with no contour altering pathologies - 22 CTs

    • Group C: cases with 6-10 cysts with some pathologies included (liver metastases, hydronephrosis, adrenal gland metastases, missing kidney) - 23 CTs

    However, in any case, the participants will not know which case belongs to which group. This information will be released after the challenge, together with the whole dataset.

    Annotation Protocol

    The first step for obtaining de labels was using the TotalSegmentator [1] [2] to get rough annotations. Then, the labels were sent to three radiologists (R1, R2, R3), to both correct the automatic annotations and add possible missing organs. One of the three labeling radiologists, the MD PhD candidate, previously defined both the dataset cohort and the criteria of what belongs to the parenchyma and what does not and it was given to the other two labeling radiologists to follow the same criteria to be coherent with each other [3]. Separately, two other clinicians (C1, C2) supervised the criteria of the cohort defined by the MD PhD candidate, but not having any relation with the labeling itself, hence, there is no bias between the annotations of the different radiologists.

    Each labeled class for this challenge has specific instructions. Below are listed per organ.

    • Liver:
      Generally speaking, we define the liver 'as the entire liver tissue including all internal structures like vessel systems, tumors etc.' [4] Thus, the portal vein itself is excluded from contouring. The two main branches of the portal vein are excluded from the segmentation. Any branch of the following generations is included. 'In case of partial enclosure (occurring where large vessels as Vena Cava and portal vein enter or leave the liver), the parts enclosed by liver tissue are included in the segmentation, thus forming the convex hull of the liver shape.' [4] Any fatty tissue that pulls into the liver is excluded. The gallbladder should not be marked. Wide and especially pathologically widened bile ducts are included in the segmentation of the liver.
    • Kidney:
      The right and left kidney will be segmented. Included in the segmentation will be the kidney parenchyma including the renal medulla. Excluded is the renal pelvis [5] and the ureter as a urinary stasis could alter the original volume.
    • Pancreas:
      When segmenting the pancreas, we will not differentiate between head, body and tail. Moreover neither the splenic vein nor the mesenterial vein will be included in segmentation [6]. However, it is important the whole pancreas in its course is tracked and marked.

    Technical Specifications

    The CTs used needed to be contrast-enhanced CT scans in a portal venous phase with the acquisition of thin slices ranging from 0.6 to 1mm. Thoracic-Abdominal CT images were taken during the patients' hospital stay, motivated by various medical needs. Given the focus on abdominal organs, the Br40 soft kernel was employed. CT examinations were conducted using SIEMENS CT scanners at the university hospital Erlangen, with rotation speeds of 0.25 or 0.5 sec. Detector collimation varied from 128x0.6mm single source to 98x0.6x2 and 144x0.4x2 dual source configurations. Spiral pitch factors ranged from 0.3 to 1.3. The mean reference tube current was set at 200 mAs, adjustable to 120 mAs. Automated tube voltage adaptation and tube current modulation were implemented in all instances. Contrast agent administration was standard practice, with an injection rate of 3-4 mL/s and a body weight-adjusted dosage of 400 mg(iodine)/kg (equivalent to 1.14 ml/kg Iomeprol 350mg/ml). All images underwent reconstruction using soft convolution kernels and iterative techniques.

    Ethical Approval and Data Usage

  6. WESN-emulated motor execution EEG data

    • zenodo.org
    bin, json
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Strypsteen; Thomas Strypsteen (2024). WESN-emulated motor execution EEG data [Dataset]. http://doi.org/10.5281/zenodo.10907610
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Apr 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas Strypsteen; Thomas Strypsteen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains EEG measured during a motor execution task and processed as to emulate EEG originating from a wireless EEG sensor network composed of mini-EEG devices, as presented in [1]. It is a processed version of the original High Gamma dataset of [2].

    In mini-EEG devices, we cannot measure the potential between a given electrode and a distant reference (e.g. the mastoid or Cz electrode) , as we would in traditional EEG caps. Instead, we can only record the local potential between two nearby electrodes belonging to the same sensor device. To emulate this setting using a standard cap-EEG recording, we we can considers= each pair of electrodes within a certain maximum distance as a candidate electrode pair or node. By subtracting one channel from the other, we remove the common far-distance reference and obtain a signal that emulates the local potential of the node.

    We applied this method to the High Gamma dataset as follows. First, the 44 channels covering the motor cortex were selected. These channels are indicated in the channel_labels.json file. Then, the rereferencing between channels with a distance threshold of 3 cm was applied, yielding a set of 286 candidate electrode pairs or nodes. The nodes.json file indicates the specific pair of channels composing each of these nodes. These have an average inter-electrode distance of 1.98 cm and a standard deviation of 0.59 cm. Finally, we applied the preprocessing described in [2], i.e., resampling at 250 Hz, highpass filtering above 4 Hz, standardizing the per-node mean and variance to 0 and 1 respectively, and extracting a window of 4.5 seconds for each trial.

    [1] Strypsteen, Thomas, and Alexander Bertrand. "A distributed neural network architecture for dynamic sensor selection with application to bandwidth-constrained body-sensor networks." arXiv preprint arXiv:2308.08379 (2023).

    [2] Schirrmeister, Robin Tibor, et al. "Deep learning with convolutional neural networks for EEG decoding and visualization." Human brain mapping 38.11 (2017): 5391-5420.

  7. TRACIv2.1 for FEDEFLv1

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). TRACIv2.1 for FEDEFLv1 [Dataset]. https://catalog.data.gov/dataset/traciv2-1-for-fedeflv1
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    TRACIv2.1 (Bare 2012) is a life cycle impact assessment (LCIA) method. LCIA methods are collections of characterization factors, which are measures of relative potency or potential impact, for a given flow (e.g., NH3 to air) for a set of impact categories (e.g., acidification), provided in units of potency or impact equivalents per unit mass of the flowable associated with a given context (e.g., 1.88 kg SO2 eq/kg NH3 emitted to air). LCIA methods are typically used along with life cycle inventory data to estimate potential impacts in life cycle assessment (LCA). The FEDEFL or Federal LCA Commons Elementary Flow List (EPA 2019) is the standardized elementary flow list for use with data meeting the US Federal LCA Commons data guidelines. In this dataset, TRACv2.1 is applied to FEDEFL v1.0.7 flows. This dataset was created by the LCIA Formatter v1.0 (https://github.com/USEPA/LCIAformatter). The LCIA Formatter is a tool for providing standardized life cycle impact assessment methods with characterization factors transparently applied to flows from an authoritative flow list, like the FEDEFL. The LCIA Formatter draws from the original TRACIv2.1 source file and the TRACI->FEDEFL flow mapping. The LCIA formatter accesses this mapping file through the fedelemflowlist tool available @ https://github.com/USEPA/Federal-LCA-Commons-Elementary-Flow-List. This mapping file and a note about the mapping are provided separately. Where a flow context is less specific in the FEDEFL (e.g., air) relative to the TRACIv2.1 flow contexts (e.g., air/rural), the LCIA Formatter applies the average of the relevant characterization factors from TRACIv2.1 to the FEDEFL flow. The zip file is a compressed archive of JSON files following the openLCA schema at https://greendelta.github.io/olca-schema. Usage Notes for zip file: This file was tested to correctly import into an openLCA v1.10 database already containing flows from the FEDEFL v1.0.7. It will provide matching characterization factors for any FEDEFL v1.0 to 1.0.7 elementary flow already present in the database. This file itself does not contain the elementary flows. The complete FEDEFL v1.0.7 flow list may be retrieved from the Federal LCA Commons elementary flow list repository @ https://www.lcacommons.gov The .parquet file is in the LCIA Formatter's LCIAmethod format. https://github.com/USEPA/LCIAformatter/blob/v1.0.0/format%20specs/LCIAmethod.md Usage notes for parquet file: The .parquet file can be read in by any Apache parquet reader. References Bare, J. C. 2012. Tool for the Reduction and Assessment of Chemical and Other Environmental Impacts (TRACI), Version 2.1 - User’s Manual https://www.epa.gov/chemical-research/tool-reduction-and-assessment-chemicals-and-other-environmental-impacts-traci EPA 2019. The Federal LCA Commons Elementary Flow List: Background, Approach, Description and Recommendations for Use. https://cfpub.epa.gov/si/si_public_record_Report.cfm?dirEntryId=347251. This dataset is associated with the following publication: Young, B., M. Srocka, W. Ingwersen, B. Morelli, S. Cashman, and A. Henderson. LCIA Formatter. Journal of Open Source Software. Journal of Open Source Software, 6(66): 3392, (2021).

  8. NCAR's Database of Upper Air Observations

    • rda.ucar.edu
    • cmr.earthdata.nasa.gov
    Updated Jan 26, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Data Archive/Computational and Information Systems Laboratory/National Center for Atmospheric Research/University Corporation for Atmospheric Research (2011). NCAR's Database of Upper Air Observations [Dataset]. https://rda.ucar.edu/datasets/ds370.0/#!docs
    Explore at:
    Dataset updated
    Jan 26, 2011
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Research Data Archive/Computational and Information Systems Laboratory/National Center for Atmospheric Research/University Corporation for Atmospheric Research
    Time period covered
    Jan 1, 1973 - Nov 2, 2011
    Description

    This dataset represents a large collection of global upper air soundings starting in 1973 and is updated within about 1 month of real time. This archive contains all upper air soundings from DSS datasets ds353.4 and ds351.0, however, this dataset can be output in a simple ASCII format for the entire period of record with all variable units standardized.

    This data set is superseded by The NCAR Upper Air Database, 1920-continuing in ds370.1

  9. Z

    Classification and Quantification of Strawberry Fruit Shape

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feldmann, Mitchell J. (2020). Classification and Quantification of Strawberry Fruit Shape [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3365714
    Explore at:
    Dataset updated
    Apr 24, 2020
    Dataset authored and provided by
    Feldmann, Mitchell J.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "Classification and Quantification of Strawberry Fruit Shape" is a dataset that includes raw RGB images and binary images of strawberry fruit. These folders contain JPEG images taken from the same experimental units on 2 different harvest dates. Images in each folder are labeled according to the 4 digit plot ID from the field experiment (####_) and the 10 digit individual ID (_##########).

    "H1" and "H2" folders contain RGB images of multiple fruits. Each fruit was extracted and binarized to become the images in "H1_indiv" and "H2_indiv".

    "H1_indiv" and "H2_indiv" folders contain images of individual fruit. Each fruit is bordered by ten white pixels. There are a total of 6,874 images between these two folders. The images were used then resized and scaled to be the images in "ReSized".

    "ReSized" contains 6,874 binary images of individual berries. These images are all square images (1000x1000px) with the object represented by black pixels (0) and background represented with white pixels (1). Each image was scaled so that it would take up the maximum number of pixels in a 1000 x 1000px image and would maintain the aspect ratio.

    "Fruit_image_data.csv" contains all of the morphometric features extracted from individual images including intermediate values.

    All images title with the form "B##_NA" were discarded prior to any analyses. These images come from the buffer plots, not the experimental units of the study.

    "PPKC_Figures.zip" contains all figures (F1-F7) and supplemental figures (S1-S7_ from the manuscript. Captions for the main figures are found in the manuscript. Captions for Supplemental figures are below.

    Fig. S1 Results of PPKC against original cluster assignments. Ordered centroids from k = 2 to k = 8. On the left are the unordered assignments from k-means, and the on the right are the order assignments following PPKC. Cluster position indicated on the right [1, 8].

    Fig. S2 Optimal Value of k. (A) Total within clusters sum of squares. (B) The inverse of the Adjusted R . (C) Akaike information criterion (AIC). (D) Bayesian information criterion (AIC). All metrics were calculated on a random sample of 3, 437 images (50%). 10 samples were randomly drawn. The vertical dashed line in each plot represents the optimal value of k. Reported metrics are standardized to be between [0, 1].

    Fig. S3 Hierarchical clustering and distance between classes on PC1. The relationship between clusters at each value of k is represented as both a dendrogram and as bar plot. The labels on the dendrogram (i.e., V1, V2, V3,..., V10) represent the original cluster assignment from k-means. The barplot to the right of each dendrogram depicts the elements of the eigenvector associated with the largest eigenvalue form PPKC. The labels above each line represent the original cluster assignment.

    Fig. S4 BLUPs for 13 selected features. For each plot, the X-axis is the index and the Y-axis is the BLUP value estimated from a linear mixed model. Grey points represent the mean feature value for each individual. Each point is the BLUP for a single genotype.

    Fig. S5 Effects of Eigenfruit, Vertical Biomass, and Horizontal Biomass Analyses. (A) Effects of PC [1, 7] from the Eigenfruit analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the horizontal pixel position. The vertical axis is the vertical pixel position. (B) Effects of PC [1, 3] from the Horizontal Biomass analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the vertical position from the image (height). The vertical axis is the number of activated pixels (RowSum) at the given vertical position. (C) Effects of PC [1, 3] from the Vertical Biomass analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the horizontal position from the image (width). The vertical axis is the number of activated pixels (ColSum) at the given horizontal position.

    Fig. S6 PPKC with variable sample size. Ordered centroids from k = 2 to k = 5 using different image sets for clustering. For all k = [2, 5], k-means clustering was performed using either 100, 80, 50%, or 20% of the total number of images; 6,874, 5, 500, 3, 437, and 1, 374 respectively. Cluster position indicated on the right [1, 5].

    Fig. S7 Comparison of scale and continuous features. (A.) PPKC 4-unit ordinal scale. (B.) Distributions of the selected features with each level of k = 4 from the PPKC 4-unit ordinal scale. The light gray line is cluster 1, the medium gray line is cluster 2, the dark gray line is cluster 3, and the black line is cluster 4.

  10. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

  11. e

    Kolipsi-1 Corpus v1.0 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Kolipsi-1 Corpus v1.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2c3090fc-b198-5b8d-be02-52648ed86ec6
    Explore at:
    Dataset updated
    Oct 22, 2023
    Description

    The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus. The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed. CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness.

  12. e

    LAGOS-NE-LIMNO v1.087.3: A module for LAGOS-NE, a multi-scaled geospatial...

    • portal.edirepository.org
    bin, csv, pdf
    Updated Jul 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patricia Soranno; Noah Lottig; Austin Delany; Kendra Cheruvelil (2019). LAGOS-NE-LIMNO v1.087.3: A module for LAGOS-NE, a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of U.S. Lakes: 1925-2013 [Dataset]. http://doi.org/10.6073/pasta/08c6f9311929f4874b01bcc64eb3b2d7
    Explore at:
    csv(32100 bytes), pdf(436777 bytes), pdf(103659 bytes), pdf(137561 bytes), bin(4860 bytes), bin(2872 bytes), pdf(94489 bytes), bin(37699 bytes), bin(5256 bytes), csv(9321882 bytes), csv(272653196 bytes), pdf(218689 bytes), bin(2069 bytes), bin(1563 bytes)Available download formats
    Dataset updated
    Jul 22, 2019
    Dataset provided by
    EDI
    Authors
    Patricia Soranno; Noah Lottig; Austin Delany; Kendra Cheruvelil
    Time period covered
    Jul 24, 1925 - Oct 27, 2013
    Area covered
    Variables measured
    tn, tp, dkn, doc, nh4, no2, srp, tdn, tdp, tkn, and 108 more
    Description

    This data package, LAGOS-NE-LIMNO v1.087.3, is 1 of 5 data packages associated with the LAGOS-NE database-- the LAke multi-scaled GeOSpatial and temporal database. With this release, only this data package is being updated and users are expected to use prior releases of the other types of data. Please see the attached additional documentation for a full description of the changes that have been made for this new release.The data packages that make up LAGOS-NE include the following information on lakes and reservoirs in 17 lake-rich states in the Northeastern and upper Midwestern U.S. (1) LAGOS-NE-LOCUS v1.01: lake location and physical characteristics for all lakes greater than one hectare. (2) LAGOS-NE-GEO v1.05: ecological context (i.e., the land use, geologic, climatic, and hydrologic setting of lakes) for all lakes and for all spatial resolutions, also called ‘zones’ (i.e., ecoregions, states, counties). These geospatial data were created by processing national-scale and publicly-accessible datasets to quantify numerous metrics at multiple spatial resolutions. (3) LAGOS-NE-LIMNO v1.087.3: in-situ measurements of lake water quality from the past three decades for approximately 2,600-12,000 lakes, depending on the variable. This module was created by harmonizing 87 water quality datasets from federal, state, tribal, and non-profit agencies, university researchers, and citizen scientists. This module includes variables that are most commonly measured by state agencies and researchers for studying eutrophication. For each water quality data value, we also include metadata related to the sampling program, methods, qualifiers with data flags from the original program (qual, not standardized for LAGOS-NE), censor codes from our quality control procedures (censorcode, standardized for LAGOS-NE), and the date of each sample. (4) LAGOS-NE-GIS v1.0: the GIS data layers for lakes, wetlands, and streams, as well as the spatial resolutions that were used to create the LAGOS-NE-GEO module. (5) LAGOS-NE-RAWDATA: the original 87 datasets of lake water quality prior to processing, the R code that converts the original data formats into LAGOS-NE data format, and the log file from this procedure to create LAGOS-NE. This latter data package supports the reproducibility of the LAGOS-NE-LIMNO data module. Citation for the full documentation of this database: Soranno, P.A., E.G. Bissell, K.S. Cheruvelil, S.T. Christel, S.M. Collins, C.E. Fergus, C.T. Filstrup, J.F. Lapierre, N.R. Lottig, S.K. Oliver, C.E. Scott, N.J. Smith, S. Stopyak, S. Yuan, M.T. Bremigan, J.A. Downing, C. Gries, E.N. Henry, N.K. Skaff, E.H. Stanley, C.A. Stow, P.-N. Tan, T. Wagner, K.E. Webster. 2015. Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse. GigaScience 4:28 https://doi.org/10.1186/s13742-015-0067-4 Citation for the data paper for this database: Soranno, P.A., L.C. Bacon, M. Beauchene, K.E. Bednar, E.G. Bissell, C.K. Boudreau, M.G. Boyer, M.T. Bremigan, S.R. Carpenter, J.W. Carr, K.S. Cheruvelil, S.T. Christel, M. Claucherty, S.M.Collins, J.D. Conroy, J.A. Downing, J. Dukett, C.E. Fergus, C.T. Filstrup, C. Funk, M.J. Gonzalez, L.T. Green, C. Gries, J.D. Halfman, S.K. Hamilton, P.C. Hanson, E.N. Henry, E.M. Herron, C. Hockings, J.R. Jackson, K. Jacobson-Hedin, L.L. Janus, W.W. Jones, J.R. Jones, C.M. Keson, K.B.S. King, S.A. Kishbaugh, J.F. Lapierre, B. Lathrop, J.A. Latimore, Y. Lee, N.R. Lottig, J.A. Lynch, L.J. Matthews, W.H. McDowell, K.E.B. Moore, B.P. Neff, S.J. Nelson, S.K. Oliver, M.L. Pace, D.C. Pierson, A.C. Poisson, A.I. Pollard, D.M. Post, P.O. Reyes, D.O. Rosenberry, K.M. Roy, L.G. Rudstam, O. Sarnelle, N.J. Schuldt, C.E. Scott, N.K. Skaff, N.J. Smith, N.R. Spinelli, J.J. Stachelek, E.H. Stanley, J.L. Stoddard, S.B. Stopyak, C.A. Stow, J.M. Tallant, P.-N. Tan, A.P. Thorpe, M.J. Vanni, T. Wagner, G. Watkins, K.C. Weathers, K.E. Webster, J.D. White, M.K. Wilmes, S. Yuan. 2017. LAGOS-NE: A multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of U.S. lakes. Gigascience 6(12) https://doi.org/10.1093/gigascience/gix101

  13. d

    Data from: A multi-sensor gait dataset collected under non-standardized...

    • search.dataone.org
    • data.niaid.nih.gov
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuanyuan Liao; Junjie Cao; Lisha Yu; Jianbang Xiang; Yang Zhao (2025). A multi-sensor gait dataset collected under non-standardized dual-task conditions [Dataset]. http://doi.org/10.5061/dryad.2rbnzs7z3
    Explore at:
    Dataset updated
    Apr 26, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Yuanyuan Liao; Junjie Cao; Lisha Yu; Jianbang Xiang; Yang Zhao
    Description

    Non-standardized dual-tasks have recently gained attention in health monitoring and post-operative rehabilitation. By collecting data with multiple sensors, we can quantify motion characteristics from different perspectives and explore the complementarity and interchangeability between sensors. Currently, there is a lack of publicly available non-standardized dual-task gait datasets collected with multiple sensors, thus we proposed a dataset (NONSD-Gait) from 23 healthy adults walking back and forth over 7 meters under three dual-task conditions, collected by three types of sensors: optical motion capture (MOCAP) system, depth camera and inertial measurement unit (IMU). MoCap captured the 3D trajectories of 22 markers attached to the subject using 8 optical cameras, while the depth camera recorded the 3D trajectories of 25 joints through a non-contact depth camera. The IMU was placed on the left ankle to record 3-axis acceleration and angular velocity data. Each participant underwent tw..., This study recruited 23 healthy adults aged between 21 and 30 years old (9 males, 14 females). All participants had no neuromuscular diseases or skeletal injuries, and without any hearing or vision impairment. In the experiment, participants were required to walk back and forth on a 7m × 1m mat under a single-task condition and three non-standardized dual-task conditions. This study utilized eight MOCAP cameras (NOKOV MARS 2H HD cameras, 100Hz, resolution 2048 × 1088), one depth camera (Microsoft Kinect V2.0, 30Hz, resolution 512 × 424), and one inertial sensor (Witmotion BWT901BLECL5.0C, 100Hz) for data collection.

    For the walking data from each task recorded by MOCAP, 22 reflective markers were firstly labeled using MOCAP's software and the 3D trajectories of the markers were then exported. The data collected by IMU were exported using IMU's software, and the data collected by Kinect were exported using its proprietary SDK. The data collected by the three sensors we..., , # A multi-sensor gait dataset collected under non-standardized dual-task conditions

    https://doi.org/10.5061/dryad.2rbnzs7z3

    Description of the data and file structure

    File: Data.zip

    Description:

    • The "Raw" folder contains the complete experimental data collected by the three sensors.
    • The "Processed" folder includes segmented data from the three sensors.
    • The "Parameters" folder contains the spatio-temporal gait parameters and kinematic parameters extracted separately by the three sensors.
    • The demographic information of the subjects is stored in a file named **Demographics.csv. **All participants signed informed consent and agreed to the open publication of the data. This study does not involve sensitive data.

    Each participant’s folder is named: sampleID, with ID ranging from 01 to 23.

    The folders for the two repeated experiments are named with timeID, with ID ranging from 1 to 2.

    The tasks are na...,

  14. OCR large data set

    • kaggle.com
    Updated May 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Mann (2023). OCR large data set [Dataset]. https://www.kaggle.com/datasets/jame5mann/ocr-large-data-set/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    James Mann
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the large data set as featured in the OCR H240 exam series.

    Questions about this dataset will be featured in the statistics paper

    The LDS is a .xlsx file containing 5 tables, four data, one information. The data is drawn from the UK censuses from the years 2001 and 2011. It is designed for you to make comparisons and analyses of the changes in demographic and behavioural features of the populace. There is the age structure of each local authority and the method of travel within each local authority.

  15. n

    Data from: WiBB: An integrated method for quantifying the relative...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Beijing Normal University
    Field Museum of Natural History
    Authors
    Qin Li; Xiaojun Kou
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

    A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

    Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

  16. Optum ZIP5 OMOP

    • redivis.com
    application/jsonl +7
    Updated Mar 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2021). Optum ZIP5 OMOP [Dataset]. http://doi.org/10.57761/e54r-bg69
    Explore at:
    application/jsonl, arrow, parquet, spss, stata, avro, csv, sasAvailable download formats
    Dataset updated
    Mar 3, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Description

    Abstract

    Optum ZIP5 v8.0 database in the OMOP data model (https://www.ohdsi.org/data-standardization/the-common-data-model/). This dataset covers 2003-Q1 to 2020-Q2

    Section 10

    A Condition Era is defined as a span of time when the Person is assumed to have a given condition. Similar to Drug Eras, Condition Eras are chronological periods of Condition Occurrence. Combining individual Condition Occurrences into a single Condition Era serves two purposes:

    • It allows aggregation of chronic conditions that require frequent ongoing care, instead of treating each Condition Occurrence as an independent event.
    • It allows aggregation of multiple, closely timed doctor visits for the same Condition to avoid double-counting the Condition Occurrences.

    %3C!-- --%3E

    For example, consider a Person who visits her Primary Care Physician (PCP) and who is referred to a specialist. At a later time, the Person visits the specialist, who confirms the PCP's original diagnosis and provides the appropriate treatment to resolve the condition. These two independent doctor visits should be aggregated into one Condition Era.v

    Conventions

    • Condition Era records will be derived from the records in the CONDITION_OCCURRENCE table using a standardized algorithm.
    • Each Condition Era corresponds to one or many Condition Occurrence records that form a continuous interval.
    • Condition Eras are built with a Persistence Window of 30 days, meaning, if no occurrence of the same condition_concept_id happens within 30 days of any one occurrence, it will be considered the condition_era_end_date.

    %3C!-- --%3E

    The text above is taken from the OMOP CDM v5.3 Specification document.

    Section 8

    The DOMAIN table includes a list of OMOP-defined Domains the Concepts of the Standardized Vocabularies can belong to. A Domain defines the set of allowable Concepts for the standardized fields in the CDM tables. For example, the "Condition" Domain contains Concepts that describe a condition of a patient, and these Concepts can only be stored in the condition_concept_id field of the CONDITION_OCCURRENCE and CONDITION_ERA tables. This reference table is populated with a single record for each Domain and includes a descriptive name for the Domain.

    Conventions

    • There is one record for each Domain. The domains are defined by the tables and fields in the OMOP CDM that can contain Concepts describing all the various aspects of the healthcare experience of a patient.
    • The domain_id field contains an alphanumerical identifier, that can also be used as the abbreviation of the Domain.
    • The domain_name field contains the unabbreviated names of the Domain.
    • Each Domain also has an entry in the Concept table, which is recorded in the domain_concept_id field. This is for purposes of creating a closed Information Model, where all entities in the OMOP CDM are covered by unique Concept.

    %3C!-- --%3E

    The text above is taken from the OMOP CDM v5.3 Specification document.

    Section 12

    A Drug Era is defined as a span of time when the Person is assumed to be exposed to a particular active ingredient. A Drug Era is not the same as a Drug Exposure: Exposures are individual records corresponding to the source when Drug was delivered to the Person, while successive periods of Drug Exposures are combined under certain rules to produce continuous Drug Eras.

    Conventions

    • Drug Eras are derived from records in the DRUG_EXPOSURE table using a standardized algorithm.
    • Each Drug Era corresponds to one or many Drug Exposures that form a continuous interval and contain the same Drug Ingredient (active compound).
    • The drug_concept_id field only contains Concepts that have the concept_class 'Ingredient'. The Ingredient is derived from the Drug Concepts in the DRUG_EXPOSURE table that are aggregated into the Drug Era record.
    • The Drug Era Start Date is the start date of the first Drug Exposure.
    • The Drug Era End Date is the end date of the last Drug Exposure. The End Date of each Drug Exposure is either taken from the field drug_exposure_end_date or, as it is typically not available, inferred using the following rules:
    • The Gap Days determine how many total drug-free days are observed between all Drug Exposure events that contribute to a DRUG_ERA record. It is assumed that the drugs are "not stockpiled" by the patient, i.e. that if a new drug prescription or refill is observed (a new DRUG_EXPOSURE record is written), the remaining supply from the previous events is abandoned.
    • The difference between Persistence Window and Gap Days is that the former is the maximum drug-free time allowed between two subsequent DRUG_EXPOSURE records, while the latter is the sum of actual drug-free days for the given Drug Era under the abo
  17. Latest Site Treatments - Multi-Agency Ground Plot (MAGPlot) Database: A...

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +1more
    pdf, wms, zip
    Updated May 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Resources Canada (2025). Latest Site Treatments - Multi-Agency Ground Plot (MAGPlot) Database: A Repository for pan-Canadian Forest Ground Plot Data [Dataset]. https://open.canada.ca/data/dataset/60f9ab40-58be-4b6a-acf1-a7b97313e853
    Explore at:
    pdf, zip, wmsAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset provided by
    Ministry of Natural Resources of Canadahttps://www.nrcan.gc.ca/
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    Multi-Agency Ground Plot (MAGPlot) database (DB) is a pan-Canadian forest ground-plot data repository. The database synthesize forest ground plot data from various agencies, including the National Forest Inventory (NFI) and 12 Canadian jurisdictions: Alberta (AB), British Columbia (BC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Northwest Territories (NT), Ontario (ON), Prince Edward Island (PE), Quebec (QC), Saskatchewan (SK), and Yukon Territory (YT), contributed in their original format. These datasets underwent data cleaning and quality assessment using the set of rules and standards set by the contributors and associated documentations, and were standardized, harmonized, and integrated into a single, centralized, and analysis-ready database. The primary objective of the MAGPlot project is to collate and harmonize forest ground plot data and to present the data in a findable, accessible, interoperable, and reusable (FAIR) format for pan-Canadian forest research. The current version includes both historical and contemporary forest ground plot data provided by data contributors. The standardized and harmonized dataset includes eight data tables (five site related and three tree measurement tables) in a relational database schema. Site-related tables contain information on geographical locations, treatments (e.g. stand tending, regeneration, and cutting), and disturbances caused by abiotic factors (e.g., weather, wildfires) or biotic factors (e.g., disease, insects, animals). Tree-related tables, on the other hand, focus on measured tree attributes, including biophysical and growth parameters (e.g., DBH, height, crown class), species, status, stem conditions (e.g., broken or dead tops), and health conditions. While most contributors provided large and small tree plot measurements, only NFI, AB, MB, and SK contributed datasets reported at regeneration plot level (e.g., stem count, regeneration species). Future versions are expected to include updated and/or new measurement records as well as additional tables and measured and compiled (e.g., tree volume and biomass) attributes. MAGPlot is hosted through Canada’s National Forest Information System (https://nfi.nfis.org/en/maps). --------------------------------------------------- LATEST SITE TREATMENTS LAYER: --------------------------------------------------- Shows the most recently applied treatment class for each MAGPlot site. These treatment classes are broad categories, with more specific treatment details available in the full dataset. ----------- NOTES: ----------- The MAGPlot release (v1.0 and v1.1) does not include NL and SK datasets due to pending Data Sharing Agreements, ongoing data processing, or restrictions on third-party sharing. These datasets will be included in future releases. While certain jurisdictions permit open or public data sharing, given that requestor signs and adheres the Data Use agreement, there are some jurisdictions that require a jurisdiction-specific request form to be signed in addition to the Data Use Agreement form. For the MAGPlot Data Dictionary, other metadata, datasets available for open sharing (with approximate locations), data requests (for other datasets or exact coordinates), and available data visualization products, please check all the folders in the “Data and Resources” section below. Coordinates in web services have been randomized within 5km of true location to preserve site integrity Access the WMS (Web Map Service) layers from the “Data and Resources” section below. A data request must be submitted to access historical datasets, datasets restricted by data-use agreements, or exact plot coordinates using the link below. NFI Data Request Form: https://nfi.nfis.org/en/datarequestform --------------------------------- ACKNOWLEDGEMENT: --------------------------------- We acknowledge and recognize the following agencies that have contributed data to the MAGPlot database: Government of Alberta - Ministry of Agriculture, Forestry, and Rural Economic Development - Forest Stewardship and Trade Branch Government of British Columbia - Ministry of Forests - Forest Analysis and Inventory Branch Government of Manitoba - Ministry of Economic, Development, Investment, Trade, and Natural Resources - Forestry and Peatlands Branch Government of New Brunswick - Ministry of Natural Resources and Energy Development - Forestry Division, Forest Planning and Stewardship Branch Government of Newfoundland & Labrador - Department of Fisheries, Forestry and Agriculture - Forestry Branch Government of Nova Scotia - Ministry of Natural Resources and Renewables - Department of Natural Resources and Renewables Government of Northwest Territories - Department of Environment & Climate Change - Forest Management Division Government of Ontario - Ministry of Natural Resources and Forestry - Science and Research Branch, Forest Resources Inventory Unit Government of Prince Edward Island - Department of Environment, Energy, and Climate Action - Forests, Fish, and Wildlife Division Government of Quebec - Ministry of Natural Resources and Forests - Forestry Sector Government of Saskatchewan - Ministry of Environment - Forest Service Branch Government of Yukon - Ministry of Energy, Mines, and Resources - Forest Management Branch Government of Canada - Natural Resources Canada - Canadian Forest Service - National Forest Inventory Projects Office

  18. e

    ECHAM6-HAM2 model experiments to characterize the effect of parameterized...

    • b2find.eudat.eu
    Updated Oct 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ECHAM6-HAM2 model experiments to characterize the effect of parameterized autoconversion/warm rain on cloud lifetime - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/23c161c9-b65e-500d-8d10-71d402c1d917
    Explore at:
    Dataset updated
    Oct 3, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This experiment contains the model output from a series of sensitivity simulations, called "rain", carried out with the global aerosol-climate model ECHAM6-HAM2 (model version ECHAM6.1-HAM2.2-MOZ0.9). The simulations were performed within the scope of the AeroCom project (https://aerocom.met.no/). In general, the "rain" sensitivity study aims to provide a process-based observational constraint on the cloud lifetime effect by examining the parameterized precipitation stemming from warm rain. Aerosol (precursor) emission estimates of the year 2000 from the AEROCOM-II ACCMIP dataset were used as forcing. Details can be found in the associated publication of Mülmenstädt et al., (2020). The sensitivity simulations aim at investigating the effect of changing the autoconversion tuning factor (gamma) and the critical effective radius (rc) in the parameterization of autoconversion (see Fig.3 and Eq. 3 in Mülmenstädt et al., (2020). The respective setting of these parameters is indicated in the dataset group name (e.g. AeroCom ECHAM6-HAM2 warm rain sensitivity simulation gamma${x_gamma} rc${x_rc}. In the default setting of ECHAM6-HAM2, gamma is 4 and rc is -1. rc=-1 is used for simulations where no critical impact radius is applied. In addition to microphysical variables, the model output includes simulated radar reflections from CloudSat created with the satellite simulator COSP (Cloud Feedback Model Intercomparison Project Observational Simulator Package, see Bodas-Salcedo et al., (2011)). The radar reflectivities are outputted on so-called subcolumns to include information on the subgrid variability of hydrometeors. The model output is provided as global fields on a reduced Gaussian Grid (N48) with 3-hourly temporal resolution and covers the period January 2000 to December 2004. The dataset is well suited for evaluating the sensitivity of warm rain parameterization in ECHAM6-HAM2. The data publication is standardized according to the ATMODAT Standard (v3.0) (Ganske et al. 2021). The data standardization was funded within the framework of “Forschungsvorhaben zur Entwicklung und Erprobung von Kurationskriterien und Qualitätsstandards von Forschungsdaten” by the German Federal Ministry of Education and Research (BMBF; FKZ: 16QK02B).

  19. T

    Holocene pollen dataset for China

    • data.tpdc.ac.cn
    • tpdc.ac.cn
    zip
    Updated May 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xianyong CAO; Fang TIAN; Jian NI; Ulrike HERZSCHUH (2022). Holocene pollen dataset for China [Dataset]. http://doi.org/10.11888/Paleoenv.tpdc.272379
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 17, 2022
    Dataset provided by
    TPDC
    Authors
    Xianyong CAO; Fang TIAN; Jian NI; Ulrike HERZSCHUH
    Area covered
    Description

    Past vegetation and climate investigations using the pollen assemblages archived in various sediments have been performed for more than one century, hitherto, pollen is the most suitable proxy in reconstruction for the spatial-temporal patterns of past vegetation and climate at centennial- and global-scale, and a taxonomically harmonized and temporally standardized fossil pollen dataset is essential for these reconstructions. Following pollen data collection, taxonomic homogenization, and age–depth model revision, the pollen spectra were interpolated at a 100-year resolution, and the Holocene fossil pollen dataset was established for China. The Holocene pollen dataset includes 254 pollen spectra and 217 pollen taxa. Although the density of available pollen records is higher in the forest-steppe transition-zone, available pollen records are well distributed over all main vegetation types and climatic zones of China. The temporal range of the dataset covers the Holocene (from 11.5 to 0 cal. ka BP), with abundant pollen sites available between 8 and 2 cal ka BP. The Holocene pollen dataset is relative to the literature: Cao, X., Tian, F., Herzschuh, U., Ni, J., Xu, Q., Li, W., Zhang, Y., Luo, M., Chen, F., 2022. Human activities have reduced plant diversity in eastern China over the last two millennia, Global Change Biology (accepted). More detail on processing is provided in this literature.

  20. Z

    [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...

    • data.niaid.nih.gov
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Shi (2024). [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification with Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208229
    Explore at:
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Jiancheng Yang
    Donglai Wei
    Lin Zhao
    Bilian Ke
    Bingbing Ni
    Hanspeter Pfister
    Zequan Liu
    Rui Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]

    Abstract

    We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

    Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.

    Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

    Python Usage

    We recommend our official code to download, parse and use the MedMNIST dataset:

    % pip install medmnist% python

    To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

    from medmnist import PathMNIST

    train_dataset = PathMNIST(split="train")

    To enable automatic downloading by setting download=True:

    from medmnist import NoduleMNIST3D

    val_dataset = NoduleMNIST3D(split="val", download=True)

    Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

    from medmnist import ChestMNIST

    test_dataset = ChestMNIST(split="test", download=True, size=224)

    Citation

    If you find this project useful, please cite both v1 and v2 paper as:

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

    Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

    or using bibtex:

    @article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }

    @inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

    Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

    License

    The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

    The code is under Apache-2.0 License.

    Changelog

    v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.

    v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.

    v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.

    v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.

    v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.

    Note: This dataset is NOT intended for clinical use.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Organization logo

Simulation Data Set

Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

Search
Clear search
Close search
Google apps
Main menu