91 datasets found
  1. MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization...

    • zenodo.org
    bin, text/x-python +1
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.14641824
    Explore at:
    txt, bin, text/x-pythonAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    ali Azadi; ali Azadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file contains a preprocessed subset of the MIMIC-IV dataset (Medical Information Mart for Intensive Care, Version IV), specifically focusing on laboratory event data related to glucose levels. It has been curated and processed for research on data normalization and integration within Clinical Decision Support Systems (CDSS) to improve Human-Computer Interaction (HCI) elements.

    The dataset includes the following key features:

    • Raw Lab Data: Original values of glucose levels as recorded in the clinical setting.
    • Normalized Data: Glucose levels transformed into a standardized range for comparison and analysis.
    • Demographic Information: Includes patient age and gender to support subgroup analyses.

    This data has been used to analyze the impact of normalization and integration techniques on improving data accuracy and usability in CDSS environments. The file is provided as part of ongoing research on enhancing clinical decision-making and user interaction in healthcare systems.

    Key Applications:

    • Research on the effects of data normalization on clinical outcomes.
    • Study of demographic variations in laboratory values to support personalized healthcare.
    • Exploration of data integration and its role in reducing cognitive load in CDSS.

    Data Source:

    The data originates from the publicly available MIMIC-IV database, developed and maintained by the Massachusetts Institute of Technology (MIT). Proper ethical guidelines for accessing and preprocessing the dataset have been followed.

    File Content:

    • Filename: MIMIC-IV_LabEvents_Subset_Normalization.xlsx
    • File Format: Microsoft Excel
    • Number of Rows: 100 samples for demonstration purposes.
    • Fields Included: Patient ID, Age, Gender, Raw Glucose Value, Normalized Glucose Value, and additional derived statistics.
  2. Hospital Management System

    • kaggle.com
    zip
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Shamoon Butt (2025). Hospital Management System [Dataset]. https://www.kaggle.com/mshamoonbutt/hospital-management-system
    Explore at:
    zip(1049391 bytes)Available download formats
    Dataset updated
    Jun 9, 2025
    Authors
    Muhammad Shamoon Butt
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Hospital Management System project features a fully normalized relational database designed to manage hospital data including patients, doctors, appointments, diagnoses, medications, and billing. The schema applies database normalization (1NF, 2NF, 3NF) to reduce redundancy and maintain data integrity, providing an efficient, scalable structure for healthcare data management. Included are SQL scripts to create tables and insert sample data, making it a useful resource for learning practical database design and normalization in a healthcare context.

  3. f

    File S1 - Normalization of RNA-Sequencing Data from Samples with Varying...

    • datasetcatalog.nlm.nih.gov
    Updated Feb 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan (2014). File S1 - Normalization of RNA-Sequencing Data from Samples with Varying mRNA Levels [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001266682
    Explore at:
    Dataset updated
    Feb 25, 2014
    Authors
    Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan
    Description

    Table S1 and Figures S1–S6. Table S1. List of primers. Forward and reverse primers used for qPCR. Figure S1. Changes in total and polyA+ RNA during development. a) Amount of total RNA per embryo at different developmental stages. b) Amount of polyA+ RNA per 100 embryos at different developmental stages. Vertical bars represent standard errors. Figure S2. The TMM scaling factor. a) The TMM scaling factor estimated using dataset 1 and 2. We observe very similar values. b) The TMM scaling factor obtained using the replicates in dataset 2. The TMM values are very reproducible. c) The TMM scale factor when RNA-seq data based on total RNA was used. Figure S3. Comparison of scales. We either square-root transformed or used that scales directly and compared the normalized fold-changes to RT-qPCR results. a) Transcripts with dynamic change pre-ZGA. b) Transcripts with decreased abundance post-ZGA. c) Transcripts with increased expression post-ZGA. Vertical bars represent standard deviations. Figure S4. Comparison of RT-qPCR results depending on RNA template (total or poly+ RNA) and primers (random or oligo(dT) primers) for setd3 (a), gtf2e2 (b) and yy1a (c). The increase pre-ZGA is dependent on template (setd3 and gtf2e2) and not primer type. Figure S5. Efficiency calibrated fold-changes for a subset of transcripts. Vertical bars represent standard deviations. Figure S6. Comparison normalization methods using dataset 2 for transcripts with decreased expression post-ZGA (a) and increased expression post-ZGA (b). Vertical bars represent standard deviations. (PDF)

  4. f

    Data_Sheet_1_SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for...

    • frontiersin.figshare.com
    pdf
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shen Yin; Xiaowei Zhan; Bo Yao; Guanghua Xiao; Xinlei Wang; Yang Xie (2023). Data_Sheet_1_SMIXnorm: Fast and Accurate RNA-Seq Data Normalization for Formalin-Fixed Paraffin-Embedded Samples.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.650795.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Frontiers
    Authors
    Shen Yin; Xiaowei Zhan; Bo Yao; Guanghua Xiao; Xinlei Wang; Yang Xie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RNA-sequencing (RNA-seq) provides a comprehensive quantification of transcriptomic activities in biological samples. Formalin-Fixed Paraffin-Embedded (FFPE) samples are collected as part of routine clinical procedure, and are the most widely available biological sample format in medical research and patient care. Normalization is an essential step in RNA-seq data analysis. A number of normalization methods, though developed for RNA-seq data from fresh frozen (FF) samples, can be used with FFPE samples as well. The only extant normalization method specifically designed for FFPE RNA-seq data, MIXnorm, which has been shown to outperform the normalization methods, but at the cost of a complex mixture model and a high computational burden. It is therefore important to adapt MIXnorm for simplicity and computational efficiency while maintaining superior performance. Furthermore, it is critical to develop an integrated tool that performs commonly used normalization methods for both FF and FFPE RNA-seq data. We developed a new normalization method for FFPE RNA-seq data, named SMIXnorm, based on a simplified two-component mixture model compared to MIXnorm to facilitate computation. The expression levels of expressed genes are modeled by normal distributions without truncation, and those of non-expressed genes are modeled by zero-inflated Poisson distributions. The maximum likelihood estimates of the model parameters are obtained by a nested Expectation-Maximization algorithm with a less complicated latent variable structure, and closed-form updates are available within each iteration. Real data applications and simulation studies show that SMIXnorm greatly reduces computing time compared to MIXnorm, without sacrificing the performance. More importantly, we developed a web-based tool, RNA-seq Normalization (RSeqNorm), that offers a simple workflow to compute normalized RNA-seq data for both FFPE and FF samples. It includes SMIXnorm and MIXnorm for FFPE RNA-seq data, together with five commonly used normalization methods for FF RNA-seq data. Users can easily upload a raw RNA-seq count matrix and select one of the seven normalization methods to produce a downloadable normalized expression matrix for any downstream analysis. The R package is available at https://github.com/S-YIN/RSEQNORM. The web-based tool, RSeqNorm is available at http://lce.biohpc.swmed.edu/rseqnorm with no restriction to use or redistribute.

  5. b

    Data from: A systematic evaluation of normalization methods and probe...

    • nde-dev.biothings.io
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    Hospital for Sick Children
    University of Toronto
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  6. b

    Methods for normalizing microbiome data: an ecological perspective

    • nde-dev.biothings.io
    zip
    Updated Oct 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2018
    Dataset provided by
    James Cook University
    University of New England
    Authors
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description
    1. Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate comparisons among communities and were the only methods that fully normalized read depths across samples. Additionally, upper quartile, CSS, edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed. 4. Based on our simulations, normalizing via proportions may be superior to other commonly used methods for comparing ecological communities.
  7. GSE58095 Data Normalization Subtype Analysis R

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GSE58095 Data Normalization Subtype Analysis R [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse58095-data-normalization-subtype-analysis-r
    Explore at:
    zip(26134446 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains processed and normalized gene expression data from the public GEO series GSE58095.

    The dataset is prepared to support downstream analyses such as subtype classification, differential expression, and exploratory visualization.

    The content includes R scripts and processed matrices that guide users through normalization, quality control, and biological interpretation steps.

    Gene expression data were aligned, filtered, quality-checked, and normalized using widely accepted bioinformatics pipelines.

    The dataset aids researchers working in cancer genomics, transcriptomics, and molecular subtype discovery.

    Included analyses demonstrate how to classify samples into biologically meaningful subtypes using clustering and statistical approaches.

    The workflow supports reproducible research with clear steps for importing raw data, preprocessing, normalization, and generating subtype assignments.

    The dataset is intended for educational, research, and benchmarking purposes within computational biology and bioinformatics.

    All scripts are written in R for transparency and adaptability to various research workflows.

  8. d

    The time-series gene expression data in PMA stimulated THP-1

    • datamed.org
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The time-series gene expression data in PMA stimulated THP-1 [Dataset]. https://datamed.org/display-item.php?repository=0044&idName=ID&id=5841d9165152c649505fbb31
    Explore at:
    Description

    (1) qPCR Gene Expression Data The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Undifferentiated cells were harvested in TRIzol reagent at the beginning of the LPS time-course. One biological replicate was prepared for each time point. Total RNA was purified from TRIzol lysates according to manufacturer’s instructions. Genespecific primer pairs were designed using Primer3 software, with an optimal primer size of 20 bases, amplification size of 140bp, and annealing temperature of 60°C. Primer sequences were designed for 2,396 candidate genes including four potential controls: GAPDH, beta actin (ACTB), beta-2-microglobulin (B2M), phosphoglycerate kinase 1 (PGK1). The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA, USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate. Reactions were carried out in 20μL volumes in 384-well plates; each reaction contained: 0.5 U of HotStar Taq DNA polymerase (Qiagen) and the manufacturer’s 1× amplification buffer adjusted to a final concentration of 1mM MgCl2, 160μM dNTPs, 1/38000 SYBR Green I (Molecular Probes), 7% DMSO, 0.4% ROX Reference Dye (Invitrogen), 300 nM of each primer (forward and reverse), and 2μL of 40-fold diluted first-strand cDNA synthesis reaction mixture (12.5ng total RNA equivalent). Polymerase activation at 95ºC for 15 min was followed by 40 cycles of 15 s at 94ºC, 30 s at 60ºC, and 30 s at 72ºC. The dissociation curve analysis, which evaluates each PCR product to be amplified from single cDNA, was carried out in accordance with the manufacturer’s protocol. Expression levels were reported as Ct values. The large number of genes assayed and the replicates measures required that samples be distributed across multiple amplification plates, with an average of twelve plates per sample. Because it was envisioned that GAPDH would serve as a single-gene normalization control, this gene was included on each plate. All primer pairs were replicated in triplicates. Raw qPCR expression measures were quantified using Applied Biosystems SDS software and reported as Ct values. The Ct value represents the number of cycles or rounds of amplification required for the fluorescence of a gene or primer pair to surpass an arbitrary threshold. The magnitude of the Ct value is inversely proportional to the expression level so that a gene expressed at a high level will have a low Ct value and vice versa. Replicate Ct values were combined by averaging, with additional quality control constraints imposed by a standard filtering method developed by the RIKEN group for the preprocessing of their qPCR data. Briefly this method entails: 1. Sort the triplicate Ct values in ascending order: Ct1, Ct2, Ct3. Calculate differences between consecutive Ct values: difference1 = Ct2 – Ct1 and difference2 = Ct3 – Ct2. 2. Four regions are defined (where Region4 overrides the other regions): Region1: difference ≦ 0.2, Region2: 0.2 < difference ≦ 1.0, Region3: 1.0 < difference, Region4: one of the Ct values in the difference calculation is 40 If difference1 and difference2 fall in the same region, then the three replicate Ct values are averaged to give a final representative measure. If difference1 and difference2 are in different regions, then the two replicate Ct values that are in the small number region are averaged instead. This particular filtering method is specific to the data set we used here and does not represent a part of the normalization procedure itself; Alternate methods of filtering can be applied if appropriate prior to normalization. Moreover while the presentation in this manuscript has used Ct values as an example, any measure of transcript abundance, including those corrected for primer efficiency can be used as input to our data-driven methods. (2) Quantile Normalization Algorithm Quantile normalization proceeds in two stages. First, if samples are distributed across multiple plates, normalization is applied to all of the genes assayed for each sample to remove plate-to-plate effects by enforcing the same quantile distribution on each plate. Then, an overall quantile normalization is applied between samples, assuring that each sample has the same distribution of expression values as all of the other samples to be compared. A similar approach using quantile ormalization has been previously described in the context of microarray normalization. Briefly, our method entails the following steps: i) qPCR data from a single RNA sample are stored in a matrix M of dimension k (maximum number of genes or primer pairs on a plate) rows by p (number of plates) columns. Plates with differing numbers of genes are made equivalent by padded plates with missing values to constrain M to a rectangular structure. ii) Each column is sorted into ascending order and stored in matrix M’. The sorted columns correspond to the quantile distribution of each plate. The missing values are placed at the end of each ordered column. All calculations in quantile normalization are performed on non-missing values. iii) The average quantile distribution is calculated by taking the average of each row in M’. Each column in M’ is replaced by this average quantile distribution and rearranged to have the same ordering as the original row order in M. This gives the within-sample normalized data from one RNA sample. iv) Steps analogous to 1 – 3 are repeated for each sample. Between-sample normalization is performed by storing the within-normalized data as a new matrix N of dimension k (total number of genes, in our example k = 2,396) rows by n (number of samples) columns. Steps 2 and 3 are then applied to this matrix. (3) Rank-Invariant Set Normalization Algorithm We describe an extension of this method for use on qPCR data with any number of experimental conditions or samples in which we identify a set of stably-expressed genes from within the measured expression data and then use these to adjust expression between samples. Briefly, i) qPCR data from all samples are stored in matrix R of dimension g (total number of genes or primer pairs used for all plates) rows by s (total number of samples). ii) We first select gene sets that are rank-invariant across a single sample compared to a common reference. The reference may be chosen in a variety of ways, depending on the experimental design and aims of the experiment. As described in Tseng et al., the reference may be designated as a particular sample from the experiment (e.g. time zero in a time course experiment), the average or median of all samples, or selecting the sample which is closest to the average or median of all samples. Genes are considered to be rank-invariant if they retain their ordering or rank with respect to expression across the experimental sample versus the common reference sample. We collect sets of rank-invariant genes for all of the s pairwise comparisons, relative to a common reference. We take the intersection of all s sets to obtain the final set of rank-invariant genes that is used for normalization. iii) Let αj represent the average expression value of the rank-invariant genes in sample j. (α1, …, αs) then represents the vector of rank-invariant average expression values for all conditions 1 to s iv) We calculate the scale f The THP-1 cell line was sub-cloned and one clone (#5) was selected for its ability to differentiate relatively homogeneously in response to phorbol 12-myristate-13-acetate (PMA) (Sigma). THP-1.5 was used for all subsequent experiments. THP-1.5 cells were cultured in RPMI, 10% FBS, Penicillin/Streptomycin, 10mM HEPES, 1mM Sodium Pyruvate, 50uM 2-Mercaptoethanol. THP-1.5 were treated with 30ng/ml PMA over a time-course of 96h. Total cell lysates were harvested in TRIzol reagent at 1, 2, 4, 6, 12, 24, 48, 72, 96 hours, including an undifferentiated control. Total RNA was purifed from TRIzol lysates according to manufacturer’s instructions. The RNA samples were reverse transcribed to produce cDNA and then subjected to quantitative PCR using SYBR Green (Molecular Probes) using the ABI Prism 7900HT system (Applied Biosystems, Foster City, CA,USA) with a 384-well amplification plate; genes for each sample were assayed in triplicate.

  9. d

    Raw and Normalized Foraminiferal Data for Chincoteague Bay and the Marshes...

    • catalog.data.gov
    Updated Jan 6, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2026). Raw and Normalized Foraminiferal Data for Chincoteague Bay and the Marshes of Assateague Island and the Adjacent Vicinity, Maryland and Virginia- July 2014 [Dataset]. https://catalog.data.gov/dataset/raw-and-normalized-foraminiferal-data-for-chincoteague-bay-and-the-marshes-of-assateague-i-e83d4
    Explore at:
    Dataset updated
    Jan 6, 2026
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Assateague Island, Maryland, Chincoteague Bay, Virginia
    Description

    Foraminiferal samples were collected from Chincoteague Bay, Newport Bay, and Tom’s Cove as well as the marshes on the back-barrier side of Assateague Island and the Delmarva (Delaware-Maryland-Virginia) mainland by U.S. Geological Survey (USGS) researchers from the St. Petersburg Coastal and Marine Science Center in March, April (14CTB01), and October (14CTB02) 2014. Samples were also collected by the Woods Hole Coastal and Marine Science Center (WHCMSC) in July 2014 and shipped to the St. Petersburg office for processing. The dataset includes raw foraminiferal and normalized counts for the estuarine grab samples (G), terrestrial surface samples (S), and inner shelf grab samples (G). For further information regarding data collection and sample site coordinates, processing methods, or related datasets, please refer to USGS Data Series 1060 (https://doi.org/10.3133/ds1060), USGS Open-File Report 2015–1219 (https://doi.org/10.3133/ofr20151219), and USGS Open-File Report 2015-1169 (https://doi.org/10.3133/ofr20151169). Downloadable data are available as Excel spreadsheets, comma-separated values text files, and formal Federal Geographic Data Committee metadata.

  10. G

    Tick Data Normalization Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Tick Data Normalization Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/tick-data-normalization-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Tick Data Normalization Market Outlook




    According to our latest research, the global Tick Data Normalization market size reached USD 1.02 billion in 2024, reflecting robust expansion driven by the increasing complexity and volume of financial market data. The market is expected to grow at a CAGR of 13.1% during the forecast period, reaching approximately USD 2.70 billion by 2033. This growth is fueled by the rising adoption of algorithmic trading, regulatory demands for accurate and consistent data, and the proliferation of advanced analytics across financial institutions. As per our analysis, the market’s trajectory underscores the critical role of data normalization in ensuring data integrity and operational efficiency in global financial markets.




    The primary growth driver for the tick data normalization market is the exponential surge in financial data generated by modern trading platforms and electronic exchanges. With the proliferation of high-frequency trading and the integration of diverse market data feeds, financial institutions face the challenge of processing vast amounts of tick-by-tick data from multiple sources, each with unique formats and structures. Tick data normalization solutions address this complexity by transforming disparate data streams into consistent, standardized formats, enabling seamless downstream processing for analytics, trading algorithms, and compliance reporting. This standardization is particularly vital in the context of regulatory mandates such as MiFID II and Dodd-Frank, which require accurate data lineage and auditability, further propelling market growth.




    Another significant factor contributing to market expansion is the growing reliance on advanced analytics and artificial intelligence within the financial sector. As firms seek to extract actionable insights from historical and real-time tick data, the need for high-quality, normalized datasets becomes paramount. Data normalization not only enhances the accuracy and reliability of predictive models but also facilitates the integration of machine learning algorithms for tasks such as anomaly detection, risk assessment, and portfolio optimization. The increasing sophistication of trading strategies, coupled with the demand for rapid, data-driven decision-making, is expected to sustain robust demand for tick data normalization solutions across asset classes and geographies.




    Furthermore, the transition to cloud-based infrastructure has transformed the operational landscape for banks, hedge funds, and asset managers. Cloud deployment offers scalability, flexibility, and cost-efficiency, enabling firms to manage large-scale tick data normalization workloads without the constraints of on-premises hardware. This shift is particularly relevant for smaller institutions and emerging markets, where cloud adoption lowers entry barriers and accelerates the deployment of advanced data management capabilities. At the same time, the availability of managed services and API-driven platforms is fostering innovation and expanding the addressable market, as organizations seek to outsource complex data normalization tasks to specialized vendors.




    Regionally, North America continues to dominate the tick data normalization market, accounting for the largest share in terms of revenue and technology adoption. The presence of leading financial centers, advanced IT infrastructure, and a strong regulatory framework underpin the region’s leadership. Meanwhile, Asia Pacific is emerging as the fastest-growing market, driven by rapid digitalization of financial services, burgeoning capital markets, and increasing participation of retail and institutional investors. Europe also maintains a significant market presence, supported by stringent compliance requirements and a mature financial ecosystem. Latin America and the Middle East & Africa are witnessing steady growth, albeit from a lower base, as financial modernization initiatives gain momentum.





    Component Analysis




    The tick data normalizati

  11. d

    UniCourt Legal Analytics API - USA Legal Data (AI Normalized)

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UniCourt, UniCourt Legal Analytics API - USA Legal Data (AI Normalized) [Dataset]. https://datarade.ai/data-products/unicourt-legal-analytics-api-usa-legal-data-ai-normalized-unicourt
    Explore at:
    Dataset authored and provided by
    UniCourt
    Area covered
    United States
    Description

    UniCourt provides easy access to normalized legal analytics data via our Attorney Analytics API, Law Firm Analytics API, Judge Analytics API, Party Analytics API, and Court Analytics API, giving you the flexibility you need to intuitively move between interconnected data points. This structure can be used for AI & ML Training Data.

    Build the Best Legal Analytics Possible

    • UniCourt collects court data from hundreds of state and federal trial court databases, as well as attorney bar data, judicial records data, and Secretary of State data. • We then combine all of those data sets together through our entity normalization process to identify who’s who in litigation, so you can download structured data via our APIs and build the best legal analytics possible.

    Flexible Analytics APIs for Meaningful Integrations

    • UniCourt’s Legal Analytics APIs put billions of data points at your fingertips and give you the flexibility you need to integrate analytics into your matter management systems, BI dashboards, data lakes, CRMs, and other data management tools. • Create on-demand, self-service reporting options within your internal applications and set up automated data feeds to keep your mission critical analytics reports regularly refreshed with updated data.

    What Legal Analytics APIs Are Available?

    UniCourt offers a wide range of Legal Analytics APIs and various end-points to provide the data you need. Here are the core analytics APIs we provide:

    • Attorney Analytics API • Law Firm Analytics API • Judge Analytics API • Party Analytics API • Case Analytics API

  12. Single non-normalized data of electron probe analyses of all glass shard...

    • doi.pangaea.de
    zip
    Updated Apr 13, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josefine Lenz; Sebastian Wetterich; Benjamin M Jones; Hanno Meyer; Guido Grosse; Anatoly A Bobrov (2016). Single non-normalized data of electron probe analyses of all glass shard samples from the Seward Peninsula and the Lipari obsidian reference standard [Dataset]. http://doi.org/10.1594/PANGAEA.859554
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 13, 2016
    Dataset provided by
    PANGAEA
    Authors
    Josefine Lenz; Sebastian Wetterich; Benjamin M Jones; Hanno Meyer; Guido Grosse; Anatoly A Bobrov
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jul 1, 2008 - Apr 22, 2009
    Area covered
    Description

    Permafrost degradation influences the morphology, biogeochemical cycling and hydrology of Arctic landscapes over a range of time scales. To reconstruct temporal patterns of early to late Holocene permafrost and thermokarst dynamics, site-specific palaeo-records are needed. Here we present a multi-proxy study of a 350-cm-long permafrost core from a drained lake basin on the northern Seward Peninsula, Alaska, revealing Lateglacial to Holocene thermokarst lake dynamics in a central location of Beringia. Use of radiocarbon dating, micropalaeontology (ostracods and testaceans), sedimentology (grain-size analyses, magnetic susceptibility, tephra analyses), geochemistry (total nitrogen and carbon, total organic carbon, d13Corg) and stable water isotopes (d18O, dD, d excess) of ground ice allowed the reconstruction of several distinct thermokarst lake phases. These include a pre-lacustrine environment at the base of the core characterized by the Devil Mountain Maar tephra (22 800±280 cal. a BP, Unit A), which has vertically subsided in places due to subsequent development of a deep thermokarst lake that initiated around 11 800 cal. a BP (Unit B). At about 9000 cal. a BP this lake transitioned from a stable depositional environment to a very dynamic lake system (Unit C) characterized by fluctuating lake levels, potentially intermediate wetland development, and expansion and erosion of shore deposits. Complete drainage of this lake occurred at 1060 cal. a BP, including post-drainage sediment freezing from the top down to 154 cm and gradual accumulation of terrestrial peat (Unit D), as well as uniform upward talik refreezing. This core-based reconstruction of multiple thermokarst lake generations since 11 800 cal. a BP improves our understanding of the temporal scales of thermokarst lake development from initiation to drainage, demonstrates complex landscape evolution in the ice-rich permafrost regions of Central Beringia during the Lateglacial and Holocene, and enhances our understanding of biogeochemical cycles in thermokarst-affected regions of the Arctic.

  13. f

    Data from: Filtration and Normalization of Sequencing Read Data in...

    • datasetcatalog.nlm.nih.gov
    Updated Oct 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Losada, Patricia Moran; Chouvarine, Philippe; DeLuca, David S.; Wiehlmann, Lutz; Tümmler, Burkhard (2016). Filtration and Normalization of Sequencing Read Data in Whole-Metagenome Shotgun Samples [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001539099
    Explore at:
    Dataset updated
    Oct 20, 2016
    Authors
    Losada, Patricia Moran; Chouvarine, Philippe; DeLuca, David S.; Wiehlmann, Lutz; Tümmler, Burkhard
    Description

    Ever-increasing affordability of next-generation sequencing makes whole-metagenome sequencing an attractive alternative to traditional 16S rDNA, RFLP, or culturing approaches for the analysis of microbiome samples. The advantage of whole-metagenome sequencing is that it allows direct inference of the metabolic capacity and physiological features of the studied metagenome without reliance on the knowledge of genotypes and phenotypes of the members of the bacterial community. It also makes it possible to overcome problems of 16S rDNA sequencing, such as unknown copy number of the 16S gene and lack of sufficient sequence similarity of the “universal” 16S primers to some of the target 16S genes. On the other hand, next-generation sequencing suffers from biases resulting in non-uniform coverage of the sequenced genomes. To overcome this difficulty, we present a model of GC-bias in sequencing metagenomic samples as well as filtration and normalization techniques necessary for accurate quantification of microbial organisms. While there has been substantial research in normalization and filtration of read-count data in such techniques as RNA-seq or Chip-seq, to our knowledge, this has not been the case for the field of whole-metagenome shotgun sequencing. The presented methods assume that complete genome references are available for most microorganisms of interest present in metagenomic samples. This is often a valid assumption in such fields as medical diagnostics of patient microbiota. Testing the model on two validation datasets showed four-fold reduction in root-mean-square error compared to non-normalized data in both cases. The presented methods can be applied to any pipeline for whole metagenome sequencing analysis relying on complete microbial genome references. We demonstrate that such pre-processing reduces the number of false positive hits and increases accuracy of abundance estimates.

  14. Khmer Word Image Patches For Training OCR

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chanveasna ENG (2025). Khmer Word Image Patches For Training OCR [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/khwordpatches
    Explore at:
    zip(5065136281 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Chanveasna ENG
    Description

    Synthetic Khmer OCR - Pre-processed Chunks

    This dataset is a pre-processed and optimized version of the original "Synthetic Khmer OCR" dataset. All word images have been cropped, resized with padding to a uniform size, and stored in highly efficient PyTorch tensor chunks for extremely fast loading during model training.

    This format is designed to completely eliminate the I/O bottleneck that comes from reading millions of individual small image files, allowing you to feed a powerful GPU without any waiting. Why This Format?

    Extreme Speed: Loading a single chunk of 100,000 images from one file is hundreds of times faster than loading 100,000 individual PNG files.
    
    No More Pre-processing: All images are already cropped and resized. The data is ready for training right out of the box.
    
    Memory Efficient: The dataset is split into manageable chunks, so you don't need to load all ~34GB of data into RAM at once.
    

    Data Structure

    The dataset is organized into two main folders: train and val.

    / ├── train/ │ ├── train_chunk_0.pt │ ├── train_chunk_1.pt │ └── ... (and so on for all training chunks) └── val/ ├── val_chunk_0.pt ├── val_chunk_1.pt └── ... (and so on for all validation chunks)

    Inside Each Chunk File (.pt)

    Each .pt file is a standard PyTorch file containing a single Python dictionary with two keys: 'images' and 'labels'.

    'images':
    
      Type: torch.Tensor
    
      Shape: (N, 3, 40, 80), where N is the number of samples in the chunk (typically 100,000).
    
      Data Type (dtype): torch.uint8 (values from 0-255). This is done to save a massive amount of disk space. You will need to convert this to float and normalize it before feeding it to a model.
    
      Description: This tensor contains N raw, uncompressed image pixels. Each image is a 3-channel (RGB) color image with a height of 40 pixels and a width of 64 pixels.
    
    'labels':
    
      Type: list of str
    
      Length: N (matches the number of images in the tensor).
    
      Description: This is a simple Python list of strings. The label at labels[i] corresponds to the image at images[i].
    

    How to Use This Dataset in PyTorch

    Here is a simple example of how to load a chunk and access the data. ``` import torch from torchvision import transforms from PIL import Image

    --- 1. Load a single chunk file ---

    chunk_path = 'train/train_chunk_0.pt' data_chunk = torch.load(chunk_path)

    image_tensor_chunk = data_chunk['images'] labels_list = data_chunk['labels']

    print(f"Loaded chunk: {chunk_path}") print(f"Image tensor shape: {image_tensor_chunk.shape}") print(f"Number of labels: {len(labels_list)}")

    --- 2. Get a single sample (e.g., the 42nd item in this chunk) ---

    index = 42 image_uint8 = image_tensor_chunk[index] label = labels_list[index]

    print(f" --- Sample at index {index} ---") print(f"Label: {label}") print(f"Image tensor shape (as saved): {image_uint8.shape}") print(f"Image data type (as saved): {image_uint8.dtype}")

    --- 3. Prepare the image for a model ---

    You need to convert the uint8 tensor (0-255) to a float tensor (0.0-1.0)

    and then normalize it.

    a. Convert to float

    image_float = image_uint8.float() / 255.0

    b. Define the normalization (must be the same as used in training)

    normalize_transform = transforms.Normalize( mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] )

    c. Apply normalization

    normalized_image = normalize_transform(image_float)

    print(f" Image tensor shape (normalized): {normalized_image.shape}") print(f"Image data type (normalized): {normalized_image.dtype}") print(f"Min value: {normalized_image.min():.2f}, Max value: {normalized_image.max():.2f}")

    --- (Optional) 4. How to view the image ---

    To convert a tensor back to an image you can view:

    We need to un-normalize it first if we want to see the original colors.

    For simplicity, let's just convert the float tensor before normalization.

    image_to_view = transforms.ToPILImage()(image_float)

    You can now display this image_to_view

    image_to_view.show()

    image_to_view.save('sample_image.png')

    print(" Successfully prepared a sample for model input and viewing!") ```

  15. GSE65194 Data Normalization and Subtype Analysis

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GSE65194 Data Normalization and Subtype Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse65194-data-normalization-and-subtype-analysis
    Explore at:
    zip(54989436 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Raw and preprocessed microarray expression data from the GSE65194 cohort.

    Includes samples from triple-negative breast cancer (TNBC), other breast cancer subtypes, and normal breast tissues.

    Expression profiles generated using the “Affymetrix Human Genome U133 Plus 2.0 Array (GPL570)” platform. tcr.amegroups.org +2 Journal of Cancer +2

    Provides normalized gene expression values suitable for downstream analyses such as differential expression, subtype classification, and clustering.

    Supports the identification of differentially expressed genes (DEGs) between TNBC, non-TNBC subtypes, and normal tissue. Aging-US +2 tcr.amegroups.org +2

    Useful for transcriptomic analyses in breast cancer research, including subtype analysis, biomarker discovery, and comparative studies.

  16. LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...

    • plos.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.

  17. f

    Data_Sheet_2_Ensuring That Fundamentals of Quantitative Microbiology Are...

    • figshare.com
    txt
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip J. Schmidt; Ellen S. Cameron; Kirsten M. Müller; Monica B. Emelko (2023). Data_Sheet_2_Ensuring That Fundamentals of Quantitative Microbiology Are Reflected in Microbial Diversity Analyses Based on Next-Generation Sequencing.csv [Dataset]. http://doi.org/10.3389/fmicb.2022.728146.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Philip J. Schmidt; Ellen S. Cameron; Kirsten M. Müller; Monica B. Emelko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diversity analysis of amplicon sequencing data has mainly been limited to plug-in estimates calculated using normalized data to obtain a single value of an alpha diversity metric or a single point on a beta diversity ordination plot for each sample. As recognized for count data generated using classical microbiological methods, amplicon sequence read counts obtained from a sample are random data linked to source properties (e.g., proportional composition) by a probabilistic process. Thus, diversity analysis has focused on diversity exhibited in (normalized) samples rather than probabilistic inference about source diversity. This study applies fundamentals of statistical analysis for quantitative microbiology (e.g., microscopy, plating, and most probable number methods) to sample collection and processing procedures of amplicon sequencing methods to facilitate inference reflecting the probabilistic nature of such data and evaluation of uncertainty in diversity metrics. Following description of types of random error, mechanisms such as clustering of microorganisms in the source, differential analytical recovery during sample processing, and amplification are found to invalidate a multinomial relative abundance model. The zeros often abounding in amplicon sequencing data and their implications are addressed, and Bayesian analysis is applied to estimate the source Shannon index given unnormalized data (both simulated and experimental). Inference about source diversity is found to require knowledge of the exact number of unique variants in the source, which is practically unknowable due to library size limitations and the inability to differentiate zeros corresponding to variants that are actually absent in the source from zeros corresponding to variants that were merely not detected. Given these problems with estimation of diversity in the source even when the basic multinomial model is valid, diversity analysis at the level of samples with normalized library sizes is discussed.

  18. s

    Data from: Breast cancer patient-derived whole-tumor cell culture model for...

    • figshare.scilifelab.se
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinsong Chen; Emmanouil Sifakis; Johan Hartman (2025). Data from: Breast cancer patient-derived whole-tumor cell culture model for efficient drug profiling and treatment response prediction [Dataset]. http://doi.org/10.17044/scilifelab.21516993.v1
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Xinsong Chen; Emmanouil Sifakis; Johan Hartman
    License

    https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/

    Description

    Dataset Description This record is a collection of Whole-genome sequencing (WGS), RNA sequencing (RNA-seq), NanoString's nCounter® Breast Cancer 360 (BC360) Panel and cell viability assay data, generated as part of the study “Breast cancer patient-derived whole-tumor cell culture model for efficient drug profiling and treatment response prediction" by Chen et al., 2022. The WGS dataset contains raw sequencing data (BAM files) from tumor scraping cells (TSCs) at the time of surgical resection, derived whole-tumor cell (WTC) cultures from each patient's specimen, and normal skin biopsy for germline control, from five (5) breast cancer (BC) patients. Genomic DNA samples were isolated by using the QIAamp DNA mini kit (QIAGEN). The library was prepared by using Illumina TruSeq PCR-free (350 bp) according to the manufacturer’s protocol. The bulk DNA samples were then sequenced by Illumina Hiseq X and processed via the Science for Life Laboratory CAW workflow version 1.2.362 (Stockholm, Sweden; https://github.com/SciLifeLab/Sarek). The RNA-seq dataset contains raw sequencing data (fastq files) from the TSC pellets at the time of surgical resection, and the pellets of derived WTC cultures with or without tamoxifen metabolites treatment (1 nM 4OHT and 25 nM Z-Endoxifen), from 16 BC patients. 2000 ng RNA was extracted using the RNeasy mini kit (QIAGEN) from each sample, and 1 μg of total RNA was used for rRNA depletion using RiboZero (Illumina). Stranded RNA-seq libraries were constructed using TruSeq Stranded Total RNA Library Prep Kit (Illumina), and paired-end sequencing was performed on HiSeq 2500 with a 2 x 126 setup using the Science for Life Laboratory platform (Stockholm, Sweden). The NanoString's nCounter® BC360 Panel dataset contains normalized data from FFPE tissue samples of 43 BC patients. RNA was extracted from the macrodissected sections using the High Pure FFPET RNA Isolation Kit (Roche) following the manufacturer's protocols. Then, 200 ng of RNA per sample were loaded and further analyzed according to the manufacturer’s recommendation on a NanoString nCounter® system using the Breast Cancer 360 code set, which is comprised of 18 housekeeping genes and 752 target genes covering key pathways in tumor biology, microenvironment, and immune response. Raw data was assessed using several quality assurance (QA) metrics to measure imaging quality, oversaturation, and overall signal-to-noise ratio. All samples satisfying QA metric checks were background corrected (background thresholding) using the negative probes and normalized with their mean minus two standard deviations. The background-corrected data were then normalized by calculating the geometric mean of five housekeeper genes, namely ACTB, MRPL19, PSMC4, RPLP0, and SF3A1. The cell viability assay dataset for the main study contains drug sensitivity score (DSS) values for each of the tested drugs derived from the WTC spheroids of 45 BC patients. For patient DP-45, multiple regions were sampled to establish WTCs and perform drug profiling. For the neoadjuvant setting validation study, DSS values correspond to WTCs of 15 BC patients. For the drug profiling assay, each compound covered five concentrations ranging from 10 μM to 1 nM (2 μM to 0.2 nM for trastuzumab and pertuzumab) in 10-fold dilutions and was dispensed using the acoustic liquid handling system Echo 550 (Labcyte Inc) to make spotted 384-well plates. For the neoadjuvant setting validation assay, we updated the cyclophosphamide into its active metabolite form 4-hydroperoxy cyclophosphamide (4-OOH-cyclophosphamide). Each relevant compound covered eight concentrations ranging from 10 μM to 1 nM (2 μM to 0.2 nM for trastuzumab and pertuzumab) and was dispensed using the Tecan D300e Digital Dispenser (Tecan) to make spotted 384-well plates. In both experiment settings, a total volume of 40 nl of each compound condition was dispensed into each well, for limiting the final DMSO concentration to 0.1% during the treatment period. Further details on the cell viability assay, as well as the DSS estimation are available in the Materials & Methods part of Chen et al., 2022.

  19. Communities and Crime Data Set (Normalized)

    • kaggle.com
    zip
    Updated Dec 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Kanrar (2018). Communities and Crime Data Set (Normalized) [Dataset]. https://www.kaggle.com/anonymous13635/communities-and-crime-data-set-normalized
    Explore at:
    zip(959436 bytes)Available download formats
    Dataset updated
    Dec 1, 2018
    Authors
    Rohit Kanrar
    Description

    Many variables are included so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.

    The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. Many of these omitted communities were from the midwestern USA.

    Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). E.g. An attribute described as 'mean people per household' is actually the normalized (0-1) version of that value.

    The normalization preserves rough ratios of values WITHIN an attribute (e.g. double the value for double the population within the available precision - except for extreme values (all values more than 3 SD above the mean are normalized to 1.00; all values more than 3 SD below the mean are nromalized to 0.00)).

    However, the normalization does not preserve relationships between values BETWEEN attributes (e.g. it would not be meaningful to compare the value for whitePerCap with the value for blackPerCap for a community)

    A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.

  20. Khmer Subsyllables Image Patches For Training OCR

    • kaggle.com
    zip
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chanveasna ENG (2025). Khmer Subsyllables Image Patches For Training OCR [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/khsubsyllablespatches
    Explore at:
    zip(3561443645 bytes)Available download formats
    Dataset updated
    Nov 22, 2025
    Authors
    Chanveasna ENG
    Description

    Synthetic Khmer OCR - Pre-processed Chunks

    This dataset is a pre-processed and optimized version of the original "Synthetic Khmer OCR" dataset. All word images have been cropped, resized with padding to a uniform size, and stored in highly efficient PyTorch tensor chunks for extremely fast loading during model training.

    This format is designed to completely eliminate the I/O bottleneck that comes from reading millions of individual small image files, allowing you to feed a powerful GPU without any waiting. Why This Format?

    Extreme Speed: Loading a single chunk of 100,000 images from one file is hundreds of times faster than loading 100,000 individual PNG files.
    
    No More Pre-processing: All images are already cropped and resized. The data is ready for training right out of the box.
    
    Memory Efficient: The dataset is split into manageable chunks, so you don't need to load all ~34GB of data into RAM at once.
    

    Data Structure

    The dataset is organized into two main folders: train and val.

    / ├── train/ │ ├── train_chunk_0.pt │ ├── train_chunk_1.pt │ └── ... (and so on for all training chunks) └── val/ ├── val_chunk_0.pt ├── val_chunk_1.pt └── ... (and so on for all validation chunks)

    Inside Each Chunk File (.pt)

    Each .pt file is a standard PyTorch file containing a single Python dictionary with two keys: 'images' and 'labels'.

    'images':
    
      Type: torch.Tensor
    
      Shape: (N, 3, 40, 64), where N is the number of samples in the chunk (typically 100,000).
    
      Data Type (dtype): torch.uint8 (values from 0-255). This is done to save a massive amount of disk space. You will need to convert this to float and normalize it before feeding it to a model.
    
      Description: This tensor contains N raw, uncompressed image pixels. Each image is a 3-channel (RGB) color image with a height of 40 pixels and a width of 64 pixels.
    
    'labels':
    
      Type: list of str
    
      Length: N (matches the number of images in the tensor).
    
      Description: This is a simple Python list of strings. The label at labels[i] corresponds to the image at images[i].
    

    How to Use This Dataset in PyTorch

    Here is a simple example of how to load a chunk and access the data.

    import torch
    from torchvision import transforms
    from PIL import Image
    
    # --- 1. Load a single chunk file ---
    chunk_path = 'train/train_chunk_0.pt'
    data_chunk = torch.load(chunk_path)
    
    image_tensor_chunk = data_chunk['images']
    labels_list = data_chunk['labels']
    
    print(f"Loaded chunk: {chunk_path}")
    print(f"Image tensor shape: {image_tensor_chunk.shape}")
    print(f"Number of labels: {len(labels_list)}")
    
    # --- 2. Get a single sample (e.g., the 42nd item in this chunk) ---
    index = 42
    image_uint8 = image_tensor_chunk[index]
    label = labels_list[index]
    
    print(f"
    --- Sample at index {index} ---")
    print(f"Label: {label}")
    print(f"Image tensor shape (as saved): {image_uint8.shape}")
    print(f"Image data type (as saved): {image_uint8.dtype}")
    
    
    # --- 3. Prepare the image for a model ---
    # You need to convert the uint8 tensor (0-255) to a float tensor (0.0-1.0)
    # and then normalize it.
    
    # a. Convert to float
    image_float = image_uint8.float() / 255.0
    
    # b. Define the normalization (must be the same as used in training)
    normalize_transform = transforms.Normalize(
      mean=[0.5, 0.5, 0.5],
      std=[0.5, 0.5, 0.5]
    )
    
    # c. Apply normalization
    normalized_image = normalize_transform(image_float)
    
    print(f"
    Image tensor shape (normalized): {normalized_image.shape}")
    print(f"Image data type (normalized): {normalized_image.dtype}")
    print(f"Min value: {normalized_image.min():.2f}, Max value: {normalized_image.max():.2f}")
    
    
    # --- (Optional) 4. How to view the image ---
    # To convert a tensor back to an image you can view:
    # We need to un-normalize it first if we want to see the original colors.
    # For simplicity, let's just convert the float tensor before normalization.
    image_to_view = transforms.ToPILImage()(image_float)
    
    # You can now display this image_to_view
    # image_to_view.show() 
    # image_to_view.save('sample_image.png')
    print("
    Successfully prepared a sample for model input and viewing!")
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.14641824
Organization logo

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx

Explore at:
txt, bin, text/x-pythonAvailable download formats
Dataset updated
Jan 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
ali Azadi; ali Azadi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This file contains a preprocessed subset of the MIMIC-IV dataset (Medical Information Mart for Intensive Care, Version IV), specifically focusing on laboratory event data related to glucose levels. It has been curated and processed for research on data normalization and integration within Clinical Decision Support Systems (CDSS) to improve Human-Computer Interaction (HCI) elements.

The dataset includes the following key features:

  • Raw Lab Data: Original values of glucose levels as recorded in the clinical setting.
  • Normalized Data: Glucose levels transformed into a standardized range for comparison and analysis.
  • Demographic Information: Includes patient age and gender to support subgroup analyses.

This data has been used to analyze the impact of normalization and integration techniques on improving data accuracy and usability in CDSS environments. The file is provided as part of ongoing research on enhancing clinical decision-making and user interaction in healthcare systems.

Key Applications:

  • Research on the effects of data normalization on clinical outcomes.
  • Study of demographic variations in laboratory values to support personalized healthcare.
  • Exploration of data integration and its role in reducing cognitive load in CDSS.

Data Source:

The data originates from the publicly available MIMIC-IV database, developed and maintained by the Massachusetts Institute of Technology (MIT). Proper ethical guidelines for accessing and preprocessing the dataset have been followed.

File Content:

  • Filename: MIMIC-IV_LabEvents_Subset_Normalization.xlsx
  • File Format: Microsoft Excel
  • Number of Rows: 100 samples for demonstration purposes.
  • Fields Included: Patient ID, Age, Gender, Raw Glucose Value, Normalized Glucose Value, and additional derived statistics.
Search
Clear search
Close search
Google apps
Main menu