Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains a preprocessed subset of the MIMIC-IV dataset (Medical Information Mart for Intensive Care, Version IV), specifically focusing on laboratory event data related to glucose levels. It has been curated and processed for research on data normalization and integration within Clinical Decision Support Systems (CDSS) to improve Human-Computer Interaction (HCI) elements.
The dataset includes the following key features:
This data has been used to analyze the impact of normalization and integration techniques on improving data accuracy and usability in CDSS environments. The file is provided as part of ongoing research on enhancing clinical decision-making and user interaction in healthcare systems.
The data originates from the publicly available MIMIC-IV database, developed and maintained by the Massachusetts Institute of Technology (MIT). Proper ethical guidelines for accessing and preprocessing the dataset have been followed.
MIMIC-IV_LabEvents_Subset_Normalization.xlsx
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.Â
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the b...,
Study Participants and SamplesÂ
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University o..., We provide data on an Excel file, with absolute differences in beta values between replicate samples for each probe provided in different tabs for raw data and different normalization methods.
The zip-file contains supplementary files (normalized data sets and R-codes) to reproduce the analyses presented in the paper "Use of pre-transformation to cope with extreme values in important candidate features" by Boulesteix, Guillemot & Sauerbrei (Biometrical Journal, 2011). The raw data (CEL-files) are publicly available and described in the following papers: - Ancona et al, 2006. On the statistical assessment of classifiers using DNA microarray data. BMC Bioinformatics 7, 387. - Miller et al, 2005. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proceedings of the National Academy of Science 102, 13550–13555. - Minn et al, 2005. Genes that mediate breast cancer metastasis to lung. Nature 436, 518–524. - Pawitan et al, 2005. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Research 7, R953–964. - Scherzer et al, 2007. Molecular markers of early parkinsons disease based on gene expression in blood. Proceedings of the National Academy of Science 104, 955-960. - Singh et al, 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209. - Sotiriou et al, 2006. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute 98, 262–272. - Tang et al, 2009. Gene-expression profiling of peripheral blood mononuclear cells in sepsis. Critical Care Medicine 37, 882–888. - Wang et al, 2005. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671–679. - Irizarry, 2003. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31 (4), e15. - Irizarry et al, 2006. Comparison of Affymetrix GeneChip expression measures. Bioinformatics 22 (7), 789–794.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dr. Kevin Bronson provides a second year of nitrogen and water management in wheat agricultural research dataset for compute. Ten irrigation treatments from a linear sprinkler were combined with nitrogen treatments. This dataset includes notation of field events and operations, an intermediate analysis mega-table of correlated and calculated parameters, including laboratory analysis results generated during the experimentation, plus high resolution plot level intermediate data tables of SAS process output, as well as the complete raw data sensor records and logger outputs.
This proximal terrestrial high-throughput plant phenotyping data examples our early tri-metric field method, where a geo-referenced 5Hz crop canopy height, temperature and spectral signature are recorded coincident to indicate a plant health status. In this development period, our Proximal Sensing Cart Mark1 (PSCM1) platform suspends a single cluster of sensors on a dual sliding vertical placement armature.
Experimental design and operational details of research conducted are contained in related published articles, however further description of the measured data signals as well as germane commentary is herein offered.
The primary component of this dataset is the Holland Scientific (HS) CropCircle ACS-470 reflectance numbers. Which as derived here, consist of raw active optical band-pass values, digitized onboard the sensor product. Data is delivered as sequential serialized text output including the associated GPS information. Typically this is a production agriculture support technology, enabling an efficient precision application of nitrogen fertilizer. We used this optical reflectance sensor technology to investigate plant agronomic biology, as the ACS-470 is a unique performance product being not only rugged and reliable but illumination active and filter customizable.
Individualized ACS-470 sensor detector behavior and subsequent index calculation influence can be understood through analysis of white-panel and other known target measurements. When a sensor is held 120cm from a titanium dioxide white painted panel, a normalized unity value of 1.0 is set for each detector. To generate this dataset we used a Holland Scientific SC-1 device and set the 1.0 unity value (field normalize) on each sensor individually, before each data collection, and without using any channel gain boost. The SC-1 field normalization device allows a communications connection to a Windows machine, where company provided sensor control software enables the necessary sensor normalization routine, and a real-time view of streaming sensor data.
This type of active proximal multi-spectral reflectance data may be perceived as inherently “noisy”; however basic analytical description consistently resolves a biological patterning, and more advanced statistical analysis is suggested to achieve discovery. Sources of polychromatic reflectance are inherent in the environment; and can be influenced by surface features like wax or water, or presence of crystal mineralization; varying bi-directional reflectance in the proximal space is a model reality, and directed energy emission reflection sampling is expected to support physical understanding of the underling passive environmental system.
Soil in view of the sensor does decrease the raw detection amplitude of the target color returned and can add a soil reflection signal component. Yet that return accurately represents a largely two-dimensional cover and intensity signal of the target material present within each view. It does however, not represent a reflection of the plant material solely because it can contain additional features in view. Expect NDVI values greater than 0.1 when sensing plants and saturating more around 0.8, rather than the typical 0.9 of passive NDVI.
The active signal does not transmit energy to penetrate, perhaps past LAI 2.1 or less, compared to what a solar induced passive reflectance sensor would encounter. However the focus of our active sensor scan is on the uppermost expanded canopy leaves, and they are positioned to intercept the major solar energy. Active energy sensors are more easy to direct, and in our capture method we target a consistent sensor height that is 1m above the average canopy height, and maintaining a rig travel speed target around 1.5 mph, with sensors parallel to earth ground in a nadir view.
We consider these CropCircle raw detector returns to be more “instant” in generation, and “less-filtered” electronically, while onboard the “black-box” device, than are other reflectance products which produce vegetation indices as averages of multiple detector samples in time.
It is known through internal sensor performance tracking across our entire location inventory, that sensor body temperature change affects sensor raw detector returns in minor and undescribed yet apparently consistent ways.
Holland Scientific 5Hz CropCircle active optical reflectance ACS-470 sensors, that were measured on the GeoScout digital propriety serial data logger, have a stable output format as defined by firmware version. Fifteen collection events are presented.
Different numbers of csv data files were generated based on field operations, and there were a few short duration instances where GPS signal was lost. Multiple raw data files when present, including white panel measurements before or after field collections, were combined into one file, with the inclusion of the null value placeholder -9999. Two CropCircle sensors, numbered 2 and 3, were used, supplying data in a lined format, where variables are repeated for each sensor. This created a discrete data row for each individual sensor measurement instance.
We offer six high-throughput single pixel spectral colors, recorded at 530, 590, 670, 730, 780, and 800nm. The filtered band-pass was 10nm, except for the NIR, which was set to 20 and supplied an increased signal (including an increased noise).
Dual, or tandem approach, CropCircle paired sensor usage empowers additional vegetation index calculations, such as:
DATT = (r800-r730)/(r800-r670)
DATTA = (r800-r730)/(r800-r590)
MTCI = (r800-r730)/(r730-r670)
CIRE = (r800/r730)-1
CI = (r800/r590)-1
CCCI = NDRE/NDVIR800
PRI = (r590-r530)/(r590+r530)
CI800 = ((r800/r590)-1)
CI780 = ((r780/r590)-1)
The Campbell Scientific (CS) environmental data recording of small range (0 to 5 v) voltage sensor signals are accurate and largely shielded from electronic thermal induced influence, or other such factors by design. They were used as was descriptively recommended by the company. A high precision clock timing, and a recorded confluence of custom metrics, allow the Campbell Scientific raw data signal acquisitions a high research value generally, and have delivered baseline metrics in our plant phenotyping program. Raw electrical sensor signal captures were recorded at the maximum digital resolution, and could be re-processed in whole, while the subsequent onboard calculated metrics were often data typed at a lower memory precision and served our research analysis.
Improved Campbell Scientific data at 5Hz is presented for nine collection events, where thermal, ultrasonic displacement, and additional GPS metrics were recorded. Ultrasonic height metrics generated by the Honeywell sensor and present in this dataset, represent successful phenotypic recordings. The Honeywell ultrasonic displacement sensor has worked well in this application because of its 180Khz signal frequency that ranges 2m space. Air temperature is still a developing metric, a thermocouple wire junction (TC) placed in free air with a solar shade produced a low-confidence passive ambient air temperature.
Campbell Scientific logger derived data output is structured in a column format, with multiple sensor data values present in each data row. One data row represents one program output cycle recording across the sensing array, as there was no onboard logger data averaging or down sampling. Campbell Scientific data is first recorded in binary format onboard the data logger, and then upon data retrieval, converted to ASCII text via the PC based LoggerNet CardConvert application. Here, our full CS raw data output, that includes a four-line header structure, was truncated to a typical single row header of variable names. The -9999 placeholder value was inserted for null instances.
There is canopy thermal data from three view vantages. A nadir sensor view, and looking forward and backward down the plant row at a 30 degree angle off nadir. The high confidence Apogee Instruments SI-111 type infrared radiometer, non-contact thermometer, serial number 1022 was in a front position looking forward away from the platform, number 1023 with a nadir view was in middle position, and sensor number 1052 was in a rear position and looking back toward the platform frame. We have a long and successful history testing and benchmarking performance, and deploying Apogee Instruments infrared radiometers in field experimentation. They are biologically spectral window relevant sensors and return a fast update 0.2C accurate average surface temperature, derived from what is (geometrically weighted) in their field of view.
Data gaps do exist beyond null value -9999 designations, there are some instances when GPS signal was lost, or rarely on HS GeoScout logger error. GPS information may be missing at the start of data recording. However once the receiver supplies a signal the values will populate. Likewise there may be missing information at the end of a data collection, where the GPS signal was lost but sensors continue to record along with the data logger timestamping.
In the raw CS data, collections 1 through 7 are represented by only one table file, where the UTC from the GPS
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set contains energy use data from 2009-2014 for 139 municipally operated buildings. Metrics include: Site & Source EUI, annual electricity, natural gas and district steam consumption, greenhouse gas emissions and energy cost. Weather-normalized data enable building performance comparisons over time, despite unusual weather events.
(1) qPCR Gene Expression Data The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Undifferentiated cells were harvested in TRIzol reagent at the beginning of the LPS time-course. One biological replicate was prepared for each time point. Total RNA was purified from TRIzol lysates according to manufacturer’s instructions. Genespecific primer pairs were designed using Primer3 software, with an optimal primer size of 20 bases, amplification size of 140bp, and annealing temperature of 60°C. Primer sequences were designed for 2,396 candidate genes including four potential controls: GAPDH, beta actin (ACTB), beta-2-microglobulin (B2M), phosphoglycerate kinase 1 (PGK1). The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA, USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate. Reactions were carried out in 20μL volumes in 384-well plates; each reaction contained: 0.5 U of HotStar Taq DNA polymerase (Qiagen) and the manufacturer’s 1× amplification buffer adjusted to a final concentration of 1mM MgCl2, 160μM dNTPs, 1/38000 SYBR Green I (Molecular Probes), 7% DMSO, 0.4% ROX Reference Dye (Invitrogen), 300 nM of each primer (forward and reverse), and 2μL of 40-fold diluted first-strand cDNA synthesis reaction mixture (12.5ng total RNA equivalent). Polymerase activation at 95ºC for 15 min was followed by 40 cycles of 15 s at 94ºC, 30 s at 60ºC, and 30 s at 72ºC. The dissociation curve analysis, which evaluates each PCR product to be amplified from single cDNA, was carried out in accordance with the manufacturer’s protocol. Expression levels were reported as Ct values. The large number of genes assayed and the replicates measures required that samples be distributed across multiple amplification plates, with an average of twelve plates per sample. Because it was envisioned that GAPDH would serve as a single-gene normalization control, this gene was included on each plate. All primer pairs were replicated in triplicates. Raw qPCR expression measures were quantified using Applied Biosystems SDS software and reported as Ct values. The Ct value represents the number of cycles or rounds of amplification required for the fluorescence of a gene or primer pair to surpass an arbitrary threshold. The magnitude of the Ct value is inversely proportional to the expression level so that a gene expressed at a high level will have a low Ct value and vice versa. Replicate Ct values were combined by averaging, with additional quality control constraints imposed by a standard filtering method developed by the RIKEN group for the preprocessing of their qPCR data. Briefly this method entails: 1. Sort the triplicate Ct values in ascending order: Ct1, Ct2, Ct3. Calculate differences between consecutive Ct values: difference1 = Ct2 – Ct1 and difference2 = Ct3 – Ct2. 2. Four regions are defined (where Region4 overrides the other regions): Region1: difference ≦ 0.2, Region2: 0.2 < difference ≦ 1.0, Region3: 1.0 < difference, Region4: one of the Ct values in the difference calculation is 40 If difference1 and difference2 fall in the same region, then the three replicate Ct values are averaged to give a final representative measure. If difference1 and difference2 are in different regions, then the two replicate Ct values that are in the small number region are averaged instead. This particular filtering method is specific to the data set we used here and does not represent a part of the normalization procedure itself; Alternate methods of filtering can be applied if appropriate prior to normalization. Moreover while the presentation in this manuscript has used Ct values as an example, any measure of transcript abundance, including those corrected for primer efficiency can be used as input to our data-driven methods. (2) Quantile Normalization Algorithm Quantile normalization proceeds in two stages. First, if samples are distributed across multiple plates, normalization is applied to all of the genes assayed for each sample to remove plate-to-plate effects by enforcing the same quantile distribution on each plate. Then, an overall quantile normalization is applied between samples, assuring that each sample has the same distribution of expression values as all of the other samples to be compared. A similar approach using quantile ormalization has been previously described in the context of microarray normalization. Briefly, our method entails the following steps: i) qPCR data from a single RNA sample are stored in a matrix M of dimension k (maximum number of genes or primer pairs on a plate) rows by p (number of plates) columns. Plates with differing numbers of genes are made equivalent by padded plates with missing values to constrain M to a rectangular structure. ii) Each column is sorted into ascending order and stored in matrix M’. The sorted columns correspond to the quantile distribution of each plate. The missing values are placed at the end of each ordered column. All calculations in quantile normalization are performed on non-missing values. iii) The average quantile distribution is calculated by taking the average of each row in M’. Each column in M’ is replaced by this average quantile distribution and rearranged to have the same ordering as the original row order in M. This gives the within-sample normalized data from one RNA sample. iv) Steps analogous to 1 – 3 are repeated for each sample. Between-sample normalization is performed by storing the within-normalized data as a new matrix N of dimension k (total number of genes, in our example k = 2,396) rows by n (number of samples) columns. Steps 2 and 3 are then applied to this matrix. (3) Rank-Invariant Set Normalization Algorithm We describe an extension of this method for use on qPCR data with any number of experimental conditions or samples in which we identify a set of stably-expressed genes from within the measured expression data and then use these to adjust expression between samples. Briefly, i) qPCR data from all samples are stored in matrix R of dimension g (total number of genes or primer pairs used for all plates) rows by s (total number of samples). ii) We first select gene sets that are rank-invariant across a single sample compared to a common reference. The reference may be chosen in a variety of ways, depending on the experimental design and aims of the experiment. As described in Tseng et al., the reference may be designated as a particular sample from the experiment (e.g. time zero in a time course experiment), the average or median of all samples, or selecting the sample which is closest to the average or median of all samples. Genes are considered to be rank-invariant if they retain their ordering or rank with respect to expression across the experimental sample versus the common reference sample. We collect sets of rank-invariant genes for all of the s pairwise comparisons, relative to a common reference. We take the intersection of all s sets to obtain the final set of rank-invariant genes that is used for normalization. iii) Let αj represent the average expression value of the rank-invariant genes in sample j. (α1, …, αs) then represents the vector of rank-invariant average expression values for all conditions 1 to s iv) We calculate the scale f The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Total RNA was purifed from TRIzol lysates according to manufacturer’s instructions. The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA,USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes intermediate data from RiboBase that generates translation efficiency (TE). The code to generate the files can be found at https://github.com/CenikLab/TE_model.
We uploaded demo HeLa .ribo files, but due to the large storage requirements of the full dataset, I recommend contacting Dr. Can Cenik directly to request access to the complete version of RiboBase if you need the original data.
The detailed explanation for each file:
human_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in human.
human_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in human.
human_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in human.
human_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in human.
human_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in human.
human_TE_rho.rda: TE proportional similarity data as genes by genes matrix in human.
mouse_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in mouse.
mouse_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in mouse.
mouse_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in mouse.
mouse_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in mouse.
mouse_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in mouse.
mouse_TE_rho.rda: TE proportional similarity data as genes by genes matrix in mouse.
All the data was passed quality control. There are 1054 mouse samples and 835 mouse samples:
* coverage > 0.1 X
* CDS percentage > 70%
* R2 between RNA and RIBO >= 0.188 (remove outliers)
All ribosome profiling data here is non-dedup winsorizing data paired with RNA-seq dedup data without winsorizing (even though it names as flatten, it just the same format of the naming)
####code
If you need to read rda data please use load("rdaname.rda") with R
If you need to calculate proportional similarity from clr data:
library(propr)
human_TE_homo_rho <- propr:::lr2rho(as.matrix(clr_data))
rownames(human_TE_homo_rho) <- colnames(human_TE_homo_rho) <- rownames(clr_data)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a simulated retail data warehouse designed using star schema modeling principles.
It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.
This dataset set has two Fact tables:
- fact_sales_normalized.csv – No columns from the dim_* tables have been normalised.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">
However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign
Explore how denormalization affects storage, redundancy, and performance
All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.
Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.
Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.
UniCourt’s PACER API provides you with a real-time interface and bulk access to the entire PACER database of civil and criminal federal court data from U.S. District Courts, Bankruptcy Courts, Courts of Appeal, and more.
Our PACER API fully integrates with PACER data so you can streamline pulling the court data you need to automate your internal workflows while saving money on outrageous fees.
Leave behind PACER’s outdated search tools for a modern case search with the precision you need.
Search Smarter and Curb Costs
• With UniCourt’s PACER API you can download the court data you need and lower your PACER costs by pulling data smarter. • When you search for court cases using our API for PACER, your search results show (1) which cases are already available in UniCourt, (2) when they were added to our database and last updated, and (3) the UniCourt Case IDs for each case so you can easily pull any additional data you need. • Don’t pay for PACER data when you don’t have to and stop wasting time logging into PACER everyday when there’s a smarter way to search.
Bulk Access to PACER Data and Documents
• Get the complete historical data set you need for criminal and civil PACER data seamlessly integrated with all your internal applications and client facing solutions. • Leverage UniCourt's extensive free repository of case metadata, docket entries, and court documents to get bulk API access to PACER data without breaking your budget. • Get bulk court data from PACER that has been normalized with our artificial intelligence and enriched with other public data sets like attorney bar data, Secretary of State data, and judicial data.
Track PACER Litigation at Scale
• Combine the power of UniCourt’s PACER API with our Court Data API to track your litigation at scale. • Automatically track PACER cases with ease and receive alerts when new docket updates are available so you never miss a federal court filing. • Save money on outrageous PACER fees by leveraging the sophisticated algorithms we’ve developed to intelligently track court cases in bulk without incurring over-the-top fees.
Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)
Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html
Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/
https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">
https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">
https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/
This is ready to use preprocessed data saved into pickle
file.
Preprocessing stages are as follows:
- Normalizing whole data by dividing / 255.0
.
- Dividing whole data into three datasets: train, validation and test.
- Normalizing whole data by subtracting mean image
and dividing by standard deviation
.
- Transposing every dataset to make channels come first.
mean image
and standard deviation
were calculated from train dataset
and applied to all datasets.
When using user's image for classification, it has to be preprocessed firstly in the same way: normalized
, subtracted with mean image
and divided by standard deviation
.
Data written as dictionary with following keys:
x_train: (59000, 1, 28, 28)
y_train: (59000,)
x_validation: (1000, 1, 28, 28)
y_validation: (1000,)
x_test: (1000, 1, 28, 28)
y_test: (1000,)
Contains pretrained weights model_params_ConvNet1.pickle
for the model with following architecture:
Input
--> Conv
--> ReLU
--> Pool
--> Affine
--> ReLU
--> Affine
--> Softmax
Parameters:
Pool
is 2 and height = width = 2.
Architecture also can be understood as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">
Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.
Normalized Digital Surface Model - 1m resolution. The dataset contains the 1m Normalized Digital Surface Model for the District of Columbia. Some areas have limited data. The lidar dataset redaction was conducted under the guidance of the United States Secret Service. All data returns were removed from the dataset within the United States Secret Service redaction boundary except for classified ground points and classified water points.
This dataset was developed to model habitat suitability for two ungulate species on the island of Lanai. This includes raster data derived from WorldView-2 data to create a normalized difference vegetation index (NDVI). This index, in addition to other datasets, was used to develop habitat suitability models for Axis deer and mouflon sheep. Datasets and indices derived for use in modeling efforts, as well as suitability models, are included within this data release.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data and supplementary information for the paper entitled "Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages" to be published at EMNLP 2015: Conference on Empirical Methods in Natural Language Processing — September 17–21, 2015 — Lisboa, Portugal.
ABSTRACT: Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen's terms refer to a particular medical concept (i.e. text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55%.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Synthetic end-use specific electric household load profiles for four weather years in 29 European countries. For 2011, the profiles are normalized to an annual sum of 1000 to enable the user to scale them to a preferred annual consumption. For other years, a value other than 1000 is caused by the influence of weather and indicates a higher or lower consumption (e.g. higher consumption for space heating in cold years). For user convenience, we provide heatpump COPs and annual consumption values.
--This data has been withdrawn by the author.-- The corrected dataset can be found here: https://doi.org/10.20387/bonares-2657-1NP3
The data has been withdrawn for following reasons: "The published R-script was revised in the course of an external review process.” The data is no longer available for free reuse and will only be released by the data centre if there is a reasonable interest.
Summary Scaling with ranked subsampling (SRS) is an algorithm for the normalization of species count data in ecology. So far, SRS has successfully been applied to microbial community data. The normalization by SRS reduces the number of reads in each sample in such a way that (i) the total count equals Cmin, (ii) each removed OTU is less or equally abundant that any preserved OTU, and (iii) the relative frequencies of OTUs remaining in the sample after normalization are as close as possible to the frequencies in the original sample. The algorithm consists of two steps. In the first step, the counts for all OTUs are divided by a scaling factor chosen in such a way that the sum of the scaled counts (Cscaled with integer or non-integer values) equals Cmin. The relative frequencies of all OTUs remain unchanged. In the second step, the non-integer count values are converted into integers by an algorithm that we dub ranked subsampling. The scaled count Cscaled for each OTU is split into the integer-part Cint by truncating the digits after the decimal separator (Cint = floor(Cscaled)) and the fractional part Cfrac (Cfrac = Cscaled - Cint). Since ΣCint ≤ Cmin, additional ∆C = Cmin - ΣCint reads have to be added to the library to reach the total count of Cmin. This is achieved as follows. OTUs are ranked in the descending order of their Cfrac values, which lie in the open interval (0, 1). Beginning with the OTU of the highest rank, single count per OTU is added to the normalized library until the total number of added counts reaches ∆C and the sum of all counts in the normalized library equals Cmin. For example, if ∆C = 5 and the seven top Cfrac values are 0.96, 0.96, 0.88, 0.55, 0.55, 0.55, and 0.55, the following counts are added: a single count for each OTU with Cfrac of 0.96; a single count for the OTU with Cfrac of 0.88; and a single count each for two OTUs among those with Cfrac of 0.55. When the lowest Cfrag involved in picking ∆C counts is shared by several OTUs, the OTUs used for adding a single count to the library are selected in the order of their Cint values. This selection minimizes the effect of normalization on the relative frequencies of OTUs. OTUs with identical Cfrag as well as Cint are sampled randomly without replacement.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.