Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.
The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Methods
The dataset is the output of a comprehensive literature-based search that aims to collate all the evidence on where ES relationships have been mentioned and addressed. We applied systematic mapping which is based on the “Guidelines for Systematic Review in Environmental Management” developed by the Centre for Evidence-Based Conservation at Bangor University (Pullin and Stewart 2006).
The methodological framework followed the standard stages outlined for systematic mapping in environmental sciences (James et al. 2016). Briefly, we defined the scope and objectives:
· We comprehensively review and further explore the global evidence of ES trade-offs and synergies focusing on all systems including terrestrial, freshwater, and marine.
· We compiled the evidence on trade-offs and synergies among multiple ES interacting across various ecosystems.
· We performed a geographical and temporal trend analysis exploring the distribution of studies across the world examining how the focus on various ecosystem types and ES categories has evolved to highlight gaps and biases.
Then we set the criteria for study inclusion (Table 1), searched the evidence, coded, and produced the database. Extracted article information including the specific criteria is detailed in Table 1.
The first step was to search the ISI Web of Knowledge core collection (http://apps.webofknowledge.com) database, targeting the search on the ecosystem services literature and studies dealing with trade-offs/synergies, win-win outcomes or bundles when managing different ecosystem services in the landscape/seascape. All peer-reviewed journal articles written in English and Spanish have been considered for review.
The peer-reviewed literature from 2005 to 2021 was reviewed identifying relevant studies according to specific search terms. The relevant search terms and descriptive words derived from (Howe et al. 2014) adding “bundles” and “co-benefits”. Boolean nomenclatures ‘*’ = all letters were allowed after the *, were used on the root of words where several different endings applied (Figure 1). Search terms used were:
(“*ecosystem service*” OR “environment* service*” OR “ecosystem* approach*” OR “ecosystem good*” OR “environment* good*”)
AND
(“*trade-off*” OR “tradeoff*” OR “synerg*” OR “win-win*” OR “bundle*” OR “cost*and benefit*” OR “co-benefit*”) n=5194
Papers were preliminarily coded with a semantic analysis using the R package Bibliometrix (http://www.bibliometrix.org).
In the second step (Figure 1) papers were preliminarily coded with a semantic analysis using the R package Bibliometrix (http://www.bibliometrix.org). Papers were classified according to three systems: terrestrial, marine, and freshwater (Table 1). Papers with multiple systems, transitional habitats or those that could not be classified were classified as “other” (Mazor et al. 2018). Articles were classified based on the occurrence of the most frequent system words in their title, keywords, and abstract (Mazor et al. 2018). The set of system-specific words was determined by extracting the 250 most frequently used keywords from all considered articles and assigning each word to either system (articles could fall into just one of the four categories). Using this technique, we managed to classify 100% of the papers. To further enrich the dataset and make it a useful repository for science and policy, an additional sub-classification was performed, categorizing papers into the following categories: Coastal, Urban, Wetlands, Forest, Mountain, Freshwater, Agroecosystems, and Others that mainly represented multiple ecosystems (Table S1). This comprehensive classification approach enhances the dataset’s utility for various scientific and policy-making applications.
In the third step (Figure 1), applying the same technique, we classified the papers into four ES categories: habitat (supporting biodiversity related), provisioning, regulating, and cultural services (De Groot et al. 2010; MEA 2005; Sukhdev 2010; Wallace 2007). For the classification into ES categories, articles could fall into one or more of the four categories (see Table 1 for example the keywords used to classify ecosystems, ES categories, and countries). Applying this technique, we excluded 2149 papers that weren’t classified in any of the ecosystem services types categories resulting in 3629 papers (see Figure 1).
In the fourth step (Figure 1), an initial screening was conducted to identify papers that did not align with the review objectives of assessing ecosystem services trade-offs and synergies to inform policy and management decisions. We manually reviewed the titles of each paper in the dataset, excluding those that were from other fields or did not align with the review objectives. In this initial assessment, we excluded 347 papers, leaving a total of 3,286 papers for further review. A descriptive analysis of this 3286 article dataset was performed to examine the distribution of ES categories within each ecosystem type over the specified period. This analysis allowed us to conclude the prevalence of each ecosystem service category in different ecosystem types, identifying temporal trends and patterns. The number of occurrences was calculated for each ES category within each ecosystem type, expressed as counts. This allowed for the comparison of ecosystem service distributions across the selected ecosystem types.
In the fifth step (Figure 1), we employed an approach to visually represent the geographical distribution and focus of ES studies across the world. With the classification of studies in ES categories and the types of ecosystems, the papers were coded according to the country where the study was performed. It was possible to assign a specific country to 2636 studies, removing 650 studies that did not specify the country of study. From these 2636 papers classified, a proportion were global studies that consider several countries under study (499 global studies).
We developed global maps (Figure 1), each offering a unique perspective on the ES research landscape. The first map presents the total number of ES trade-off studies conducted worldwide, illustrating the geographical spread and concentration of research efforts to provide a clear overview of regions that have been extensively studied and those that may require more attention in future research. Additionally, we calculated two key metrics to assess research productivity more comprehensively: the number of research papers per capita and the number of research papers relative to Gross Domestic Product (GDP). For population and GDP, we used the most recent available data from the World Bank (https://data.worldbank.org). These alternative metrics normalize the data based on economic output and population size, providing a more balanced view of research activity across different countries (Figures S3).
Detailed maps were created featuring pie charts that highlight the different categories of ES and ecosystem types addressed for each country. These charts offer an understanding of how various ES categories and ecosystems are represented in different parts of the world. Finally, we assessed ES trade-off studies to world regions (Africa, Antarctica, Asia, Australasia, Europe, Latin America, and North America) looking at the relationships between the categories of ES. We considered papers that evaluated more than one category of ES and the papers that considered only one category of ES. This country-level analysis offers insights into regional research trends and priorities, contributing to a more localized understanding of ES studies.
In the sixth step (Figure 1), each publication in this review was critically appraised to evaluate the quality of the papers included in the review. The foundation for our critical appraisal stems from the comprehensive and multidimensional approach of Belcher et al. (2016) that is framed to evaluate research quality, which aligns well with the interdisciplinary nature of our study. Belcher et al. (2016) developed a robust framework that incorporates essential principles and criteria for assessing the quality of transdisciplinary research. This is particularly relevant for ecosystem services science and our review that contributes to advancing current knowledge by systematically synthesizing evidence on relationships among various ES across these diverse systems.
The Belcher et al. (2016) framework emphasizes four main principles: relevance, credibility (which we have adapted as methodological transparency), legitimacy (generalizability in our context), and effectiveness (significance). A continuous scoring system (ranging from 0 to 1) was applied for the four main criteria to maintain simplicity and consistency across the large number of studies. In this system, a value closer to 0 indicates that the criteria are not met, while a value closer to 1 indicates that the criteria are more closely met. This scoring method was a useful indicator of the overall quality of the paper and how well the article met the review's goals overall.
Methodological Transparency was assessed based on the clarity and completeness of methodological descriptions, including data availability, the rigor of statistical analyses, methodological detail, and reproducibility of the findings. This criterion assesses the transparency and rigor of the study's methodology, including data collection, analysis, and reporting (Belcher et al. 2016). Relevance was evaluated by the study's alignment with the review's objectives, its importance to the field, and its practical applicability. This includes the extent to which the study addresses pertinent research
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Along track temperature, Salinity, backscatter, Chlorophyll Fluoresence, and normalized water leaving radiance (nLw).
On the bow of the R/V Roger Revelle was a Satlantic SeaWiFS Aircraft Simulator (MicroSAS) system, used to estimate water-leaving radiance from the ship, analogous to to the nLw derived by the SeaWiFS and MODIS satellite sensors, but free from atmospheric error (hence, it can provide data below clouds).
The system consisted of a down-looking radiance sensor and a sky-viewing radiance sensor, both mounted on a steerable holder on the bow. A downwelling irradiance sensor was mounted at the top of the ship's meterological mast, on the bow, far from any potentially shading structures. These data were used to estimate normalized water-leaving radiance as a function of wavelength. The radiance detector was set to view the water at 40deg from nadir as recommended by Mueller et al. [2003b]. The water radiance sensor was able to view over an azimuth range of ~180deg across the ship's heading with no viewing of the ship's wake. The direction of the sensor was adjusted to view the water 90-120deg from the sun's azimuth, to minimize sun glint. This was continually adjusted as the time and ship's gyro heading were used to calculate the sun's position using an astronomical solar position subroutine interfaced with a stepping motor which was attached to the radiometer mount (designed and fabricated at Bigelow Laboratory for Ocean Sciences). Protocols for operation and calibration were performed according to Mueller [Mueller et al., 2003a; Mueller et al., 2003b; Mueller et al., 2003c]. Before 1000h and after 1400h, data quality was poorer as the solar zenith angle was too low. Post-cruise, the 10Hz data were filtered to remove as much residual white cap and glint as possible (we accept the lowest 5% of the data). Reflectance plaque measurements were made several times at local apparent noon on sunny days to verify the radiometer calibrations.
Within an hour of local apparent noon each day, a Satlantic OCP sensor was deployed off the stern of the R/V Revelle after the ship oriented so that the sun was off the stern. The ship would secure the starboard Z-drive, and use port Z-drive and bow thruster to move the ship ahead at about 25cm s-1. The OCP was then trailed aft and brought to the surface ~100m aft of the ship, then allowed to sink to 100m as downwelling spectral irradiance and upwelling spectral radiance were recorded continuously along with temperature and salinity. This procedure ensured there were no ship shadow effects in the radiometry.
Instruments include a WETLabs wetstar fluorometer, a WETLabs ECOTriplet and a SeaBird microTSG.
Radiometry was done using a Satlantic 7 channel microSAS system with Es, Lt and Li sensors.
Chl data is based on inter calibrating surface discrete Chlorophyll measure with the temporally closest fluorescence measurement and applying the regression results to all fluorescence data.
Data have been corrected for instrument biofouling and drift based on weekly purewater calibrations of the system. Radiometric data has been processed using standard Satlantic processing software and has been checked with periodic plaque measurements using a 2% spectralon standard.
Lw is calculated from Lt and Lsky and is "what Lt would be if the
sensor were looking straight down". Since our sensors are mounted at
40o, based on various NASA protocols, we need to do that conversion.
Lwn adds Es to the mix. Es is used to normalize Lw. Nlw is related to Rrs, Remote Sensing Reflectance
Techniques used are as described in:
Balch WM, Drapeau DT, Bowler BC, Booth ES, Windecker LA, Ashe A (2008) Space-time variability of carbon standing stocks and fixation rates in the Gulf of Maine, along the GNATS transect between Portland, ME, USA, and Yarmouth, Nova Scotia, Canada. J Plankton Res 30:119-139
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary.10 primary GBM and 8 recurrent GBM samples (14/18 matched) profiled using single nucleus RNA- sequencing (sci-RNA-seq3 protocol).Data Format.Data is provided as preprocessed dataset, stored in Seurat Object.Sample processing, sci-RNA-seq3 library generation, and sequencingSnap-frozen patient pGBM and rGBM tissues were chopped with a razor blade or scissors before nucleus isolation. Nuclei extraction and fixation were performed as previously described (Cao 2019), except for the use of a modified CST lysis buffer50 plus 1% of SUPERase-In RNase Inhibitor (Invitrogen, #AM2696). Lysis time and washing steps were further optimized based on human GBM tissue. Nuclei quality was checked with DAPI and Wheat Germ Agglutinin (WGA) staining. Sci-RNA-seq3 libraries were generated as previously described49 using three-level combinatorial indexing. The final libraries were sequenced on Illumina NovaSeq as follows: read 1: 34bp, read 2: >=69bp, index 1: 10bp, index 2: 10bp.Demultiplexing and read alignments.Raw sequencing reads were first demultiplexed based on i5/i7 PCR barcodes. FASTQ files were then processed using the sci-RNA-Seq3 pipeline. After barcodes and unique molecular identifiers (UMIs) were extracted from the read1 of FASTQ files, read alignment was performed using STAR short-read aligner (v2.5.2b) with the human genome (hg19) and Gencode v24 gene annotations. After removing duplicate reads based on UMI, barcode, chromosome and alignment position, reads were summarized into a count matrix of M genes × N nuclei.Filtering, normalization, integration, and dimensional reduction.Raw count matrices were loaded into a Seurat object (version 4.0.1) and filtered to retain cells with (i) 200 – 9000 recovered genes per cell, (ii) less than 60% mitochondrial content, and (iii) unmatched rate within 3 median absolute deviations of the median. To normalize count matrix, we adopted the modeling framework previously described and implemented in sctransform (R Package, version 0.3.2). In brief, count data were modelled by regularized negative binomial regression, using sequencing depth as a model covariate to regress out the influence of technical effects, and Pearson residuals were used as the normalized and variance stabilized biological signal for downstream analysis. Data from each patient were integrated with the reciprocal PCA method (Seurat) using the top 2000 variable features. PCA was performed on the integrated dataset, and the top N components that accounted for 90% of the observed variance were used for UMAP embedding, RunUMAP(max_components = 2, n_neighbours = 50, min_dist = 01, metric = cosine).Contact.Contact Dr. Nicholas Mikolajewicz regarding any questions about the data or analysis (n.mikolajewicz@utoronto.ca)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data can broadly be divided in discrete or continuous data. Continuous data are obtained from measurements that are performed as a function of another quantitative variable, e.g., time, length, concentration, or wavelength. The results from these types of experiments are often used to generate plots that visualize the measured variable on a continuous, quantitative scale. To simplify state-of-the-art data visualization and annotation of data from such experiments, an open-source tool was created with R/shiny that does not require coding skills to operate it. The freely available web app accepts wide (spreadsheet) and tidy data and offers a range of options to normalize the data. The data from individual objects can be shown in 3 different ways: (1) lines with unique colors, (2) small multiples, and (3) heatmap-style display. Next to this, the mean can be displayed with a 95% confidence interval for the visual comparison of different conditions. Several color-blind-friendly palettes are available to label the data and/or statistics. The plots can be annotated with graphical features and/or text to indicate any perturbations that are relevant. All user-defined settings can be stored for reproducibility of the data visualization. The app is dubbed PlotTwist and runs locally or online: https://huygens.science.uva.nl/PlotTwist
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.