Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.
The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
Video on normalizing microbiome data from the Research Experiences in Microbiomes Network
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for reproducing analysis in the manuscript:Normalizing and denoising protein expression data from droplet-based single cell profilinglink to manuscript: https://www.biorxiv.org/content/10.1101/2020.02.24.963603v1
Data deposited here are for the purposes of reproducing the analysis results and figures reported in the manuscript above. These data are all publicly available downloaded and converted to R datasets prior to Dec 4, 2020. For a full description of all the data included in this repository and instructions for reproducing all analysis results and figures, please see the repository: https://github.com/niaid/dsb_manuscript.
For usage of the dsb R package for normalizing CITE-seq data please see the repository: https://github.com/niaid/dsb
If you use the dsb R package in your work please cite:Mulè MP, Martins AJ, Tsang JS. Normalizing and denoising protein expression data from droplet-based single cell profiling. bioRxiv. 2020;2020.02.24.963603.
General contact: John Tsang (john.tsang AT nih.gov)
Questions about software/code: Matt Mulè (mulemp AT nih.gov)
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
To standardize NEON organismal data for major taxonomic groups, we first systematically reviewed NEON’s documentations for each taxonomic group. We then discussed as a group and with NEON staff to decide how to wrangle and standardize NEON organismal data. See Li et al. 2022 for more details. All R code to process NEON data products can be obtained through the R package ‘ecocomDP’. Once the data are in ecocomDP format, we further processed them to convert them into long data frames with code on Github (https://github.com/daijiang/neonDivData/tree/master/data-raw), which is also archived here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The global spectrum of plant form and function dataset (Diaz et al. 2022; Diaz et al. 2016; TRY 2022, accessed 15-May-2025) provides mean trait values for (i) plant height; (ii) stem specific density; (iii) leaf area; (iv) leaf mass per area; (v) leaf nitrogen content per dry mass; and (vi) diaspore (seed or spore) mass for 46,047 taxa.
Here I provide a dataset where the taxa covered by that database were standardized to World Flora Online (Borsch et al. 2020; taxonomic backbone version 2023.12) by matching names with those in the Agroforestry Species Switchboard (Kindt et al. 2025; version 4). Taxa for which no matches could be found were standardized with the WorldFlora package (Kindt 2020), using similar R scripts and the same taxonomic backbone data as those used to standardize species names for the Switchboard. Where still no matches could be found, taxa were matched with those matched previously with a harmonized data set for TRY 6.0 (Kindt 2024).
References
Funding
The development of this dataset was supported by the German International Climate Initiative (IKI) to the regional tree seed programme on The Right Tree for the Right Place for the Right Purpose in Africa, by Norway’s International Climate and Forest Initiative through the Royal Norwegian Embassy in Ethiopia to the Provision of Adequate Tree Seed Portfolio project in Ethiopia, and by the Bezos Earth Fund to the Quality Tree Seed for Africa in Kenya and Rwanda project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
This repository is associated with NSF DBI 2033973, RAPID Grant: Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2033973). Specifically, this repository contains (1) raw data from iDigBio (http://portal.idigbio.org) and GBIF (https://www.gbif.org), (2) R code for reproducible data wrangling and improvement, (3) protocols associated with data enhancements, and (4) enhanced versions of the dataset published at various project milestones. Additional code associated with this grant can be found in the BIOSPEX repository (https://github.com/iDigBio/Biospex). Long-term data management of the enhanced specimen data created by this project is expected to be accomplished by the natural history collections curating the physical specimens, a list of which can be found in this Zenodo resource.
Grant abstract: "The award to Florida State University will support research contributing to the development of georeferenced, vetted, and versioned data products of the world's specimens of horseshoe bats and their relatives for use by researchers studying the origins and spread of SARS-like coronaviruses, including the causative agent of COVID-19. Horseshoe bats and other closely related species are reported to be reservoirs of several SARS-like coronaviruses. Species of these bats are primarily distributed in regions where these viruses have been introduced to populations of humans. Currently, data associated with specimens of these bats are housed in natural history collections that are widely distributed both nationally and globally. Additionally, information tying these specimens to localities are mostly vague, or in many instances missing. This decreases the utility of the specimens for understanding the source, emergence, and distribution of SARS-COV-2 and similar viruses. This project will provide quality georeferenced data products through the consolidation of ancillary information linked to each bat specimen, using the extended specimen model. The resulting product will serve as a model of how data in biodiversity collections might be used to address emerging diseases of zoonotic origin. Results from the project will be disseminated widely in opensource journals, at scientific meetings, and via websites associated with the participating organizations and institutions. Support of this project provides a quality resource optimized to inform research relevant to improving our understanding of the biology and spread of SARS-CoV-2. The overall objectives are to deliver versioned data products, in formats used by the wider research and biodiversity collections communities, through an open-access repository; project protocols and code via GitHub and described in a peer-reviewed paper, and; sustained engagement with biodiversity collections throughout the project for reintegration of improved data into their local specimen data management systems improving long-term curation.
This RAPID award will produce and deliver a georeferenced, vetted and consolidated data product for horseshoe bats and related species to facilitate understanding of the sources, distribution, and spread of SARS-CoV-2 and related viruses, a timely response to the ongoing global pandemic caused by SARS-CoV-2 and an important contribution to the global effort to consolidate and provide quality data that are relevant to understanding emergent and other properties the current pandemic. This RAPID award is made by the Division of Biological Infrastructure (DBI) using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."
Files included in this resource
9d4b9069-48c4-4212-90d8-4dd6f4b7f2a5.zip: Raw data from iDigBio, DwC-A format
0067804-200613084148143.zip: Raw data from GBIF, DwC-A format
0067806-200613084148143.zip: Raw data from GBIF, DwC-A format
1623690110.zip: Full export of this project's data (enhanced and raw) from BIOSPEX, CSV format
bionomia-datasets-attributions.zip: Directory containing 103 Frictionless Data packages for datasets that have attributions made containing Rhinolophids or Hipposiderids, each package also containing a CSV file for mismatches in person date of birth/death and specimen eventDate. File bionomia-datasets-attributions-key_2021-02-25.csv included in this directory provides a key between dataset identifier (how the Frictionless Data package files are named) and dataset name.
bionomia-problem-dates-all-datasets_2021-02-25.csv: List of 21 Hipposiderid or Rhinolophid records whose eventDate or dateIdentified mismatches a wikidata recipient’s date of birth or death across all datasets.
flagEventDate.txt: file containing term definition to reference in DwC-A
flagExclude.txt: file containing term definition to reference in DwC-A
flagGeoreference.txt: file containing term definition to reference in DwC-A
flagTaxonomy.txt: file containing term definition to reference in DwC-A
georeferencedByID.txt: file containing term definition to reference in DwC-A
identifiedByNames.txt: file containing term definition to reference in DwC-A
instructions-to-get-people-data-from-bionomia-via-datasetKey: instructions given to data providers
RAPID-code_collection-date.R: code associated with enhancing collection dates
RAPID-code_compile-deduplicate.R: code associated with compiling and deduplicating raw data
RAPID-code_external-linkages-bold.R: code associated with enhancing external linkages
RAPID-code_external-linkages-genbank.R: code associated with enhancing external linkages
RAPID-code_external-linkages-standardize.R: code associated with enhancing external linkages
RAPID-code_people.R: code associated with enhancing data about people
RAPID-code_standardize-country.R: code associated with standardizing country data
RAPID-data-dictionary.pdf: metadata about terms included in this project’s data, in PDF format
RAPID-data-dictionary.xlsx: metadata about terms included in this project’s data, in spreadsheet format
rapid-data-providers_2021-05-03.csv: list of data providers and number of records provided to rapid-joined-records_country-cleanup_2020-09-23.csv
rapid-final-data-product_2021-06-29.zip: Enhanced data from BIOSPEX, DwC-A format
rapid-final-gazetteer.zip: Gazetteer providing georeference data and metadata for 10,341 localities assessed as part of this project
rapid-joined-records_country-cleanup_2020-09-23.csv: data product initial version where raw data has been compiled and deduplicated, and country data has been standardized
RAPID-protocol_collection-date.pdf: protocol associated with enhancing collection dates
RAPID-protocol_compile-deduplicate.pdf: protocol associated with compiling and deduplicating raw data
RAPID-protocol_external-linkages.pdf: protocol associated with enhancing external linkages
RAPID-protocol_georeference.pdf: protocol associated with georeferencing
RAPID-protocol_people.pdf: protocol associated with enhancing data about people
RAPID-protocol_standardize-country.pdf: protocol associated with standardizing country data
RAPID-protocol_taxonomic-names.pdf: protocol associated with enhancing taxonomic name data
RAPIDAgentStrings1_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol
recordedByNames.txt: file containing term definition to reference in DwC-A
Rhinolophid-HipposideridAgentStrings_and_People2_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol
wikidata-notes-for-bat-collectors_leachman_2020: please see https://zenodo.org/record/4724139 for this resource
This dataset contains data and code used for the paper "Carbon flux estimates are sensitive to data source: A comparison of field and lab temperature sensitivity data" [DOI COMING SOON]A large literature exists on mechanisms driving soil production of the greenhouse gases CO2 and CH4. Measurements of these gases’ fluxes are often performed using closed-chamber incubations in the laboratory or in situ, i.e., in the field. Although it is common knowledge that measurements obtained through field studies vs. laboratory incubations can diverge because of the vastly different conditions of these environments, few studies have systematically examined these patterns. It is crucial to understand the magnitude and reasons for any differences, as these data are used to parametrize and benchmark ecosystem- to global-scale models, which are then susceptible to the biases of the source data. Here, we specifically examine how greenhouse gas measurements may be influenced by whether the measurement/incubation was conducted in the field vs. laboratory, focusing on CO2 and CH4 measurements. We use Q10 of greenhouse gas flux (temperature sensitivity) for our analyses, because of the ubiquity of this metric in biological and Earth system sciences and its importance to many modeling frameworks. We predicted that laboratory measurements would be less variable, but also less representative of true field conditions. However, there was greater variability in the Q10 values calculated from lab-based measurements of CO2 fluxes, because lab experiments explore extremes rarely seen in situ, and reflect the physical and chemical disturbances occurring during sampling, transport, and incubation. Overall, respiration Q10 values were significantly greater in laboratory incubations (mean = 4.19) than field measurements (mean = 3.05), with strong influences of incubation temperature and climate region/biome. However, this was in part because field measurements typically represent total respiration (Rs), whereas lab incubations typically represent heterotrophic respiration (Rh), making direct comparisons difficult to interpret. Focusing only on Rh-derived Q10, these values showed almost identical distributions across laboratory (n = 1110) and field (n = 581) experiments, providing strong support for using the former as an experimental proxy for the latter, although we caution that geographic biases in the extant data make this conclusion tentative. Due to a smaller sample size of CH4 Q10 data, we were unable to perform a comparable robust analysis, but we expect similar interactions with soil temperature, moisture, and environmental/climatic variables. Our results here suggest the need for more concerted efforts to document and standardize these data, including sample and site metadata. This dataset contains a compressed (.zip) archive of the data and R scripts used for this manuscript. The dataset includes files in .csv format, which can be accessed and processed using MS Excel or R. This archive can also be accessed on GitHub at https://github.com/kaizadp/field_lab_q10 (DOI: 10.5281/zenodo.7106554).
The following datasets are used for the Water Rights Demand Analysis project and are formatted to be used in the calculations. The State Water Resources Control Board Division of Water Rights (Division) has developed a methodology to standardize and improve the accuracy of water diversion and use data that is used to determine water availability and inform water management and regulatory decisions. The Water Rights Demand Data Analysis Methodology (Methodology https://www.waterboards.ca.gov/drought/drought_tools_methods/demandanalysis.html ) is a series of data pre-processing steps, R Scripts, and data processing modules that identify and help address data quality issues related to both the self-reported water diversion and use data from water right holders or their agents and the Division of Water Rights electronic water rights data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DBNorm installation. Describes how to install DBNorm via devtools in R. (TXT 4Â kb)
Anyone looking to process these data should standardize it, then processes it using AICc in R.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effective data management plays a key role in oceanographic research as cruise-based data, collected from different laboratories and expeditions, are commonly compiled to investigate regional to global oceanographic processes. Here we describe new and updated best practice data standards for discrete chemical oceanographic observations, specifically those dealing with column header abbreviations, quality control flags, missing value indicators, and standardized calculation of certain properties. These data standards have been developed with the goals of improving the current practices of the scientific community and promoting their international usage. These guidelines are intended to standardize data files for data sharing and submission into permanent archives. They will facilitate future quality control and synthesis efforts and lead to better data interpretation. In turn, this will promote research in ocean biogeochemistry, such as studies of carbon cycling and ocean acidification, on regional to global scales. These best practice standards are not mandatory. Agencies, institutes, universities, or research vessels can continue using different data standards if it is important for them to maintain historical consistency. However, it is hoped that they will be adopted as widely as possible to facilitate consistency and to achieve the goals stated above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DBNorm test script. Code of how we test DBNorm package. (TXT 2Â kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effective data management plays a key role in oceanographic research as cruise-based data, collected from different laboratories and expeditions, are commonly compiled to investigate regional to global oceanographic processes. Here we describe new and updated best practice data standards for discrete chemical oceanographic observations, specifically those dealing with column header abbreviations, quality control flags, missing value indicators, and standardized calculation of certain properties. These data standards have been developed with the goals of improving the current practices of the scientific community and promoting their international usage. These guidelines are intended to standardize data files for data sharing and submission into permanent archives. They will facilitate future quality control and synthesis efforts and lead to better data interpretation. In turn, this will promote research in ocean biogeochemistry, such as studies of carbon cycling and ocean acidification, on regional to global scales. These best practice standards are not mandatory. Agencies, institutes, universities, or research vessels can continue using different data standards if it is important for them to maintain historical consistency. However, it is hoped that they will be adopted as widely as possible to facilitate consistency and to achieve the goals stated above.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
1. Plant fire syndromes are usually defined as combinations of fire response traits, the most common being resprouting (R) and seeding (S). Plant flammability (F), on the other hand, refers to a plant's effects on communities and ecosystems. Despite its important ecological and evolutionary implications, F has rarely been considered to define plant fire syndromes and, if so, usually separated from response syndromes.
2. We propose a three-dimensional model that combines R, S and F, encapsulating both plant response to fire regimes and the capacity to promote them. Each axis is divided into three possible standardized categories, reflecting low, medium and high values of each variable, with a total of 27 possible combinations of R, S and F.
3. We hypothesized that different fire histories should be reflected in the position of species within the three-dimensional space and that this should help assess the importance of fire as an evolutionary force in determining R-S-F syndromes.
4. To illustrate our approach we compiled information on the fire syndromes of 24 dominant species of different growth forms from the Chaco seasonally-dry forest of central Argentina, and we compared them to 33 species from different Mediterranean-type climate ecosystems (MTCEs) of the world.
5. Chaco and MTCEs species differed in the range (seven syndromes vs. thirteen syndromes, respectively) and proportion of extreme syndromes (i.e. species with extreme values of R, S and/or F) representing 29% of species in the Chaco vs. 45% in the MTCEs.
6. Additionally, we explored the patterns of R, S and F of 4032 species from seven regions with contrasting fire histories, and found significantly higher frequencies of extreme values (predominantly high) of all three variables in MTCEs compared to the other regions, where intermediate and low values predominated, broadly supporting our general hypothesis.
7. The proposed three-dimensional approach should help standardize comparisons of fire syndromes across taxa, growth forms and regions with different fire histories. This will contribute to the understanding of the role of fire in the evolution of plant traits and assist vegetation modelling in the face of changes in fire regimes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In 2015, the Ethiopian Federal Ministry of Health (FMOH) developed the Saving Lives through Safe Surgery (SaLTS) initiative to improve national surgical care. Previous work led to development and implementation of 15 surgical key performance indicators (KPIs) to standardize surgical data practices. The objective of this project is to investigate current practices of KPI data collection and assess quality to improve data management and strengthen surgical systems. The first portion of the study documented the surgical data collection process including methods, instruments, and effectiveness at 10 hospitals across 2 regions in Ethiopia. Secondly, data for KPIs of focus [1. Surgical Volume, 2. Perioperative Mortality Rate (POMR), 3. Adverse Anesthetic Outcome (AAO), 4. Surgical Site Infection (SSI), and 5. Safe Surgery Checklist (SSC) Utilization] were compared between registries, KPI reporting forms, and the DHIS2 (district health information system) electronic database for a 6-month period (January—June 2022). Quality was assessed based on data completeness and consistency. The data collection process involved hospital staff recording data elements in registries, quality officers calculating KPIs, completing monthly KPI reporting forms, and submitting data into DHIS2 for the national and regional health bureaus. Data quality verifications revealed discrepancies in consistency at all hospitals, ranging from 1–3 indicators. For all hospitals, average monthly surgical volume was 57 cases, POMR was 0.38% (13/3399), inpatient SSI rate was 0.79% (27/3399), AAO rate was 0.15% (5/3399), and mean SSC utilization monthly was 93% (100% median). Half of the hospitals had incomplete data within the registries, ranging from 2–5 indicators. AAO, SSC, and SSI were commonly missing data in registries. Non-standardized KPI reporting forms contributed significantly to the findings. Facilitators to quality data collection included continued use of registries from previous interventions and use of a separate logbook to document specific KPIs. Delayed rollout of these indicators in each region contributed to issues in data quality. Barriers involved variable indicator recording from different personnel, data collection tools that generate false positives (i.e. completeness of SSC defined as paper form filled out prior to patient discharge) or missing data because of reporting time period (i.e. monthly SSI may miss infections outside of one month), inadequate data elements in registries, and lack of standardized monthly KPI reporting forms. As the FMOH introduces new indicators and changes, we recommend continuous and consistent quality checks and data capacity building, including the use of routinely generated health information for quality improvement projects at the department level.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.