83 datasets found

V
Data from: Normalization and analysis of DNA microarray data by...
data.virginia.gov
healthdata.gov
+1more
html
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Normalization and analysis of DNA microarray data by self-consistency and local regression [Dataset]. https://data.virginia.gov/dataset/normalization-and-analysis-of-dna-microarray-data-by-self-consistency-and-local-regression
Explore at:
htmlAvailable download formats
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
A robust semi-parametric normalization technique has been developed, based on the assumption that the large majority of genes will not have their relative expression levels changed from one treatment group to the next, and on the assumption that departures of the response from linearity are small and slowly varying. The method was tested using data simulated under various error models and it performs well.
d
Data and Code for: \"Universal Adaptive Normalization Scale (AMIS):...
search.dataone.org
dataverse.harvard.edu
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kravtsov, Gennady (2025). Data and Code for: \"Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System\" [Dataset]. http://doi.org/10.7910/DVN/BISM0N
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BISM0N
Dataset updated
Nov 15, 2025
Dataset provided by
Harvard Dataverse
Authors
Kravtsov, Gennady
Description
Dataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.
f
Data from: A Statistical Approach for Identifying the Best Combination of...
datasetcatalog.nlm.nih.gov
acs.figshare.com
+1more
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jha, Girish Kumar; Mishra, Dwijesh Chandra; Sakthivel, Kabilan; Khan, Yasin Jeshima; Lal, Shashi Bhushan; Madival, Sharanbasappa D; Vaidhyanathan, Ramasubramanian; Chaturvedi, Krishna Kumar; Srivastava, Sudhir (2024). A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001385078
Explore at:
Dataset updated
Dec 11, 2024
Authors
Jha, Girish Kumar; Mishra, Dwijesh Chandra; Sakthivel, Kabilan; Khan, Yasin Jeshima; Lal, Shashi Bhushan; Madival, Sharanbasappa D; Vaidhyanathan, Ramasubramanian; Chaturvedi, Krishna Kumar; Srivastava, Sudhir
Description
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set’s suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named ’lfproQC’ and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
Robust RT-qPCR Data Normalization: Validation and Selection of Internal...
plos.figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daijun Ling; Paul M. Salvaterra (2023). Robust RT-qPCR Data Normalization: Validation and Selection of Internal Reference Genes during Post-Experimental Data Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0017762
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0017762
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Daijun Ling; Paul M. Salvaterra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reverse transcription and real-time PCR (RT-qPCR) has been widely used for rapid quantification of relative gene expression. To offset technical confounding variations, stably-expressed internal reference genes are measured simultaneously along with target genes for data normalization. Statistic methods have been developed for reference validation; however normalization of RT-qPCR data still remains arbitrary due to pre-experimental determination of particular reference genes. To establish a method for determination of the most stable normalizing factor (NF) across samples for robust data normalization, we measured the expression of 20 candidate reference genes and 7 target genes in 15 Drosophila head cDNA samples using RT-qPCR. The 20 reference genes exhibit sample-specific variation in their expression stability. Unexpectedly the NF variation across samples does not exhibit a continuous decrease with pairwise inclusion of more reference genes, suggesting that either too few or too many reference genes may detriment the robustness of data normalization. The optimal number of reference genes predicted by the minimal and most stable NF variation differs greatly from 1 to more than 10 based on particular sample sets. We also found that GstD1, InR and Hsp70 expression exhibits an age-dependent increase in fly heads; however their relative expression levels are significantly affected by NF using different numbers of reference genes. Due to highly dependent on actual data, RT-qPCR reference genes thus have to be validated and selected at post-experimental data analysis stage rather than by pre-experimental determination.
m
Data Normalization Method for Geo-Spatial Analysis on Ports
data.mendeley.com
Updated Jun 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nazmus Sakib (2020). Data Normalization Method for Geo-Spatial Analysis on Ports [Dataset]. http://doi.org/10.17632/skn24jntn3.1
Explore at:
Unique identifier
https://doi.org/10.17632/skn24jntn3.1
Dataset updated
Jun 6, 2020
Authors
Nazmus Sakib
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Based on open access data, 79 Mediterranean passenger ports are analyzed to compare their infrastructure, hinterland accessibility and offered multi-modality. Comparative Geo-spatial analysis is also carried out by using the data normalization method in order to visualize the ports' performance on maps. These data driven comprehensive analytical results can bring added value to sustainable development policy and planning initiatives in the Mediterranean Region. The analyzed elements can be also contributed to the development of passenger port performance indicators. The empirical research methods used for the Mediterranean passenger ports can be replicated for transport nodes of any region around the world to determine their relative performance on selected criteria for improvement and planning.

The Mediterranean passenger ports were initially categorizing into cruise and ferry ports. The cruise ports were identified from the member list of the Association for the Mediterranean Cruise Ports (MedCruise), representing more than 80% of the cruise tourism activities per country. The identified cruise ports were mapped by selecting the corresponding geo-referenced ports from the map layer developed by the European Marine Observation and Data Network (EMODnet). The United Nations (UN) Code for Trade and Transport Locations (LOCODE) was identified for each of the cruise ports as the common criteria to carry out the selection. The identified cruise ports not listed by the EMODnet were added to the geo-database by using under license the editing function of the ArcMap (version 10.1) geographic information system software. The ferry ports were identified from the open access industry initiative data provided by the Ferrylines, and were mapped in a similar way as the cruise ports (Figure 1).

Based on the available data from the identified cruise ports, a database (see Table A1–A3) was created for a Mediterranean scale analysis. The ferry ports were excluded due to the unavailability of relevant information on selected criteria (Table 2). However, the cruise ports serving as ferry passenger ports were identified in order to maximize the scope of the analysis. Port infrastructure and hinterland accessibility data were collected from the recent statistical reports published by the MedCruise, which are a compilation of data provided by its individual member port authorities and the cruise terminal operators. Other supplementary sources were the European Sea Ports Organization (ESPO) and the Global Ports Holding, a cruise terminal operator with an established presence in the Mediterranean. Additionally, open access data sources (e.g. the Google Maps and Trip Advisor) were consulted in order to identify the multi-modal transports and bridge the data gaps on hinterland accessibility by measuring the approximate distances.
MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization...
zenodo.org
text/x-python
Updated Oct 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.17272946
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17272946
Dataset updated
Oct 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
ali Azadi; ali Azadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.
d
Data from: Profound effect of normalization on detection of differentially...
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis [Dataset]. https://catalog.data.gov/dataset/profound-effect-of-normalization-on-detection-of-differentially-expressed-genes-in-oligonu
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
A number of procedures for normalization and detection of differentially expressed genes have been proposed. Four different normalization methods and all possible combinations with three different statistical algorithms have been used for detection of differentially expressed genes on a dataset. The number of genes detected as differentially expressed differs by a factor of about three.
f
Data from: MS-DAP Platform for Downstream Data Analysis of Label-Free...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Koopmans; Ka Wan Li; Remco V. Klaassen; August B. Smit (2023). MS-DAP Platform for Downstream Data Analysis of Label-Free Proteomics Uncovers Optimal Workflows in Benchmark Data Sets and Increased Sensitivity in Analysis of Alzheimer’s Biomarker Data [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00513.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.2c00513.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Frank Koopmans; Ka Wan Li; Remco V. Klaassen; August B. Smit
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the rapidly moving proteomics field, a diverse patchwork of data analysis pipelines and algorithms for data normalization and differential expression analysis is used by the community. We generated a mass spectrometry downstream analysis pipeline (MS-DAP) that integrates both popular and recently developed algorithms for normalization and statistical analyses. Additional algorithms can be easily added in the future as plugins. MS-DAP is open-source and facilitates transparent and reproducible proteome science by generating extensive data visualizations and quality reporting, provided as standardized PDF reports. Second, we performed a systematic evaluation of methods for normalization and statistical analysis on a large variety of data sets, including additional data generated in this study, which revealed key differences. Commonly used approaches for differential testing based on moderated t-statistics were consistently outperformed by more recent statistical models, all integrated in MS-DAP. Third, we introduced a novel normalization algorithm that rescues deficiencies observed in commonly used normalization methods. Finally, we used the MS-DAP platform to reanalyze a recently published large-scale proteomics data set of CSF from AD patients. This revealed increased sensitivity, resulting in additional significant target proteins which improved overlap with results reported in related studies and includes a large set of new potential AD biomarkers in addition to previously reported.
d
Study comparing scaling with ranked subsampling (SRS) and rarefying for the...
search.dataone.org
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BonaRes Repository (2025). Study comparing scaling with ranked subsampling (SRS) and rarefying for the normalization of species count data@en [Dataset]. https://search.dataone.org/view/sha256%3Aaebd3b305a7c3e99931a960ae7b540813075528f5d73dbbff839d8cf8476a98f
Explore at:
Dataset updated
Mar 21, 2025
Dataset provided by
BonaRes Repository
Area covered

Description
Study comparing scaling with ranked subsampling (SRS) and rarefying for the normalization of species count data.
Portland OR Real Estate Data 2015-2021
kaggle.com
zip
Updated Mar 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sozinizos (2022). Portland OR Real Estate Data 2015-2021 [Dataset]. https://www.kaggle.com/sozinizos/portland-real-estate-data-20152021
Explore at:
zip(59363 bytes)Available download formats
Dataset updated
Mar 9, 2022
Authors
sozinizos
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Portland, Oregon
Description
Context

In 2015/2016, PDX Monthly began to collect statistical summary data about Portland real estate on a neighborhood by neighborhood basis.

This data set aggregates and cleans up that content. The data normalization code can be found in Github.

Content

In general, the dataset contains statistical aggregations (average sale price, average days on market, etc). The PDX Monthly crowd did not provide the underlying data, and the changed their columns over the years, so the data is a bit messy.
f
Data from: Adaptive Inference for Change Points in High-Dimensional Data
figshare.com
tandf.figshare.com
pdf
Updated Apr 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yangfan Zhang; Runmin Wang; Xiaofeng Shao (2021). Adaptive Inference for Change Points in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.13757610.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13757610.v1
Dataset updated
Apr 27, 2021
Dataset provided by
Taylor & Francis
Authors
Yangfan Zhang; Runmin Wang; Xiaofeng Shao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this article, we propose a class of test statistics for a change point in the mean of high-dimensional independent data. Our test integrates the U-statistic based approach in a recent work by Wang et al. and the Lq-norm based high-dimensional test in a recent work by He et al., and inherits several appealing features such as being tuning parameter free and asymptotic independence for test statistics corresponding to even q’s. A simple combination of test statistics corresponding to several different q’s leads to a test with adaptive power property, that is, it can be powerful against both sparse and dense alternatives. On the estimation front, we obtain the convergence rate of the maximizer of our test statistic standardized by sample size when there is one change-point in mean and q = 2, and propose to combine our tests with a wild binary segmentation algorithm to estimate the change-point number and locations when there are multiple change-points. Numerical comparisons using both simulated and real data demonstrate the advantage of our adaptive test and its corresponding estimation method.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Washington and Lee University
College of William and Mary
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
h
Data from: A High Statistics Measurement of the Deuteron Structure Functions...
hepdata.net
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A High Statistics Measurement of the Deuteron Structure Functions F2 (X, $Q^2$) and R From Deep Inelastic Muon Scattering at High $Q^2$ [Dataset]. http://doi.org/10.17182/hepdata.6191.v1
Explore at:
Unique identifier
https://doi.org/10.17182/hepdata.6191.v1
Description
CERN-SPS. NA4/BCDMS Collaboration. Plab 120 - 280 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. This record contains our attempt to merge these data at different energies using the statistical errors as weight factors. The final one-sigma systematic errors given here have been calculated using a prescription from the authors involving calculation of new merged F2 values for each of the systematic errors applied individually, and the combining in quadrature the differences in the new merged F2 values and the original F2. The individual F2 values at each energy are given in separate database records. Plab 120 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. The systematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. The method involves taking the quadratic sum of the errors from each source. Plab 200 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. The systematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. This method involves taking the quadratic sum of the errors from each source. Plab 280 GeV/c. These are data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a deuterium target. The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 8& lt;Q**2& lt;260 GeV**2. CR.= The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the three energies, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalization uncertainty of data taken from external and internal targets. The systematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. This method involves taking the quadratic sum of the errors from each source.
m
A Python Code for Statistical Mirroring
data.mendeley.com
Updated Oct 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kabir Bindawa Abdullahi (2024). A Python Code for Statistical Mirroring [Dataset]. http://doi.org/10.17632/ppfvc65m2v.4
Explore at:
Unique identifier
https://doi.org/10.17632/ppfvc65m2v.4
Dataset updated
Oct 14, 2024
Authors
Kabir Bindawa Abdullahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistical mirroring is the measure of the proximity or deviation of transformed data points from a specified location estimate within a given distribution [2]. Within the framework of Kabirian-based optinalysis [1], statistical mirroring is conceptualized as the isoreflectivity of the transformed data points to a defined statistical mirror. This statistical mirror is an amplified location estimate of the distribution, achieved through a specified size or length. The location estimate may include parameters such as the mean, median, mode, maximum, minimum, or reference value [2]. The process of statistical mirroring comprises two distinct phases: a) Preprocessing phase [2]: This involves applying preprocessing transformations, such as compulsory theoretical ordering, with or without centering the data. It also encompasses tasks like statistical mirror design and optimizations within the established optinalytic construction. These optimizations include selecting an efficient pairing style, central normalization, and establishing an isoreflective pair between the preprocessed data and its designed statistical mirror. b) Optinalytic model calculation phase [1]: This phase is focused on computing estimates based on Kabirian-based isomorphic optinalysis models.

References: [1] K.B. Abdullahi, Kabirian-based optinalysis: A conceptually grounded framework for symmetry/asymmetry, similarity/dissimilarity, and identity/unidentity estimations in mathematical structures and biological sequences, MethodsX 11 (2023) 102400. https://doi.org/10.1016/j.mex.2023.102400 [2] K.B. Abdullahi, Statistical mirroring: A robust method for statistical dispersion estimation, MethodsX 12 (2024) 102682. https://doi.org/10.1016/j.mex.2024.102682
c
Short-Term Adaptation to Sound Statistics is Unimpaired in Developmental...
kilthub.cmu.edu
txt
Updated Mar 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yafit Gabay; Lori Holt (2019). Short-Term Adaptation to Sound Statistics is Unimpaired in Developmental Dyslexia - Raw Data [Dataset]. http://doi.org/10.1184/R1/6025262.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1184/R1/6025262.v1
Dataset updated
Mar 30, 2019
Dataset provided by
Carnegie Mellon University
Authors
Yafit Gabay; Lori Holt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data corresponding to Gabay, Y. & Holt, L. L. (submitted). Short-Term Adaptation to Sound Statistics is Unimpaired in Developmental Dyslexia. The data support a study examining the influence of the long-term average spectrum of preceding speech and nonspeech sounds on speech categorization among adult listeners with developmental dyslexia and typical control listeners.
UCI Automobile Dataset
kaggle.com
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Otrivedi
Description
In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Number of Instances: 398 Number of Attributes: 9 including the class attribute

Attribute Information:

mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

This data set consists of three types of entities:

I - The specification of an auto in terms of various characteristics

II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

The analysis is divided into two parts:

Data Wrangling

Pre-processing data in python

Dealing with missing values

Data formatting

Data normalization

Binning

Exploratory Data Analysis

Descriptive statistics

Groupby

Analysis of variance

Correlation

Correlation stats

Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
f
Data from: Improved Discovery of Molecular Interactions in Genome-Scale Data...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jan 22, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salzman, Julia; Klass, Daniel M.; Brown, Patrick O. (2013). Improved Discovery of Molecular Interactions in Genome-Scale Data with Adaptive Model-Based Normalization [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001709319
Explore at:
Dataset updated
Jan 22, 2013
Authors
Salzman, Julia; Klass, Daniel M.; Brown, Patrick O.
Description
BackgroundHigh throughput molecular-interaction studies using immunoprecipitations (IP) or affinity purifications are powerful and widely used in biology research. One of many important applications of this method is to identify the set of RNAs that interact with a particular RNA-binding protein (RBP). Here, the unique statistical challenge presented is to delineate a specific set of RNAs that are enriched in one sample relative to another, typically a specific IP compared to a non-specific control to model background. The choice of normalization procedure critically impacts the number of RNAs that will be identified as interacting with an RBP at a given significance threshold – yet existing normalization methods make assumptions that are often fundamentally inaccurate when applied to IP enrichment data.MethodsIn this paper, we present a new normalization methodology that is specifically designed for identifying enriched RNA or DNA sequences in an IP. The normalization (called adaptive or AD normalization) uses a basic model of the IP experiment and is not a variant of mean, quantile, or other methodology previously proposed. The approach is evaluated statistically and tested with simulated and empirical data.Results and ConclusionsThe adaptive (AD) normalization method results in a greatly increased range in the number of enriched RNAs identified, fewer false positives, and overall better concordance with independent biological evidence, for the RBPs we analyzed, compared to median normalization. The approach is also applicable to the study of pairwise RNA, DNA and protein interactions such as the analysis of transcription factors via chromatin immunoprecipitation (ChIP) or any other experiments where samples from two conditions, one of which contains an enriched subset of the other, are studied.
R
W4M00001_Sacurine-statistics
entrepot.recherche.data.gouv.fr
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Etienne Thévenot; Etienne Thévenot (2024). W4M00001_Sacurine-statistics [Dataset]. http://doi.org/10.15454/1.4811121736910142E12
Explore at:
Unique identifier
https://doi.org/10.15454/1.4811121736910142E12
Dataset updated
Dec 12, 2024
Dataset provided by
Recherche Data Gouv
Authors
Etienne Thévenot; Etienne Thévenot
License
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.15454/1.4811121736910142E12https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.15454/1.4811121736910142E12
Description
Study Characterization of the physiological variations of the metabolome in biofluids is critical to understand human physiology and to avoid confounding effects in cohort studies aiming at biomarker discovery. Dataset In this study conducted by the MetaboHUB French Infrastructure for Metabolomics, urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). A total of 258 metabolites were identified at confidence levels provided by the metabolomics standards initiative (MSI) levels 1 or 2. Workflow This history describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2): correction of signal drift (loess model built on QC pools) and batch effects (two batches), variable filtering (QC coefficent of variation < 30%), normalization by the sample osmolality, log10 transformation, sample filtering (Hotelling, decile and missing pvalues > 0.001) resulting in the HU_096 sample being discarded, univariate hypothesis testing of significant variations with age, BMI, or between genders (FDR < 0.05), and OPLS(-DA) modeling of age, BMI and gender. Comments The ‘sacurine’ data set (after normalization and filtering) is also available in the ropls R package from the Bioconductor repository. For a comprehensive analysis of the dataset (starting from the preprocessing of the raw files and including all detected features in the subsequent steps), please see the companion ‘W4M00002_Sacurine-comprehensive’ reference history.
ChIPnorm: A Statistical Method for Normalizing and Identifying Differential...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishanth Ulhas Nair; Avinash Das Sahu; Philipp Bucher; Bernard M. E. Moret (2023). ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries [Dataset]. http://doi.org/10.1371/journal.pone.0039573
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0039573
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nishanth Ulhas Nair; Avinash Das Sahu; Philipp Bucher; Bernard M. E. Moret
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The advent of high-throughput technologies such as ChIP-seq has made possible the study of histone modifications. A problem of particular interest is the identification of regions of the genome where different cell types from the same organism exhibit different patterns of histone enrichment. This problem turns out to be surprisingly difficult, even in simple pairwise comparisons, because of the significant level of noise in ChIP-seq data. In this paper we propose a two-stage statistical method, called ChIPnorm, to normalize ChIP-seq data, and to find differential regions in the genome, given two libraries of histone modifications of different cell types. We show that the ChIPnorm method removes most of the noise and bias in the data and outperforms other normalization methods. We correlate the histone marks with gene expression data and confirm that histone modifications H3K27me3 and H3K4me3 act as respectively a repressor and an activator of genes. Compared to what was previously reported in the literature, we find that a substantially higher fraction of bivalent marks in ES cells for H3K27me3 and H3K4me3 move into a K27-only state. We find that most of the promoter regions in protein-coding genes have differential histone-modification sites. The software for this work can be downloaded from http://lcbb.epfl.ch/software.html.
P
Peru Social Security: No of Contributors
ceicdata.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com, Peru Social Security: No of Contributors [Dataset]. https://www.ceicdata.com/en/peru/social-security-statistics/social-security-no-of-contributors
Explore at:
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2006 - Dec 1, 2017
Area covered
Peru
Variables measured
Employment
Description
Peru Social Security: Number of Contributors data was reported at 568,178.000 Person in 2017. This records an increase from the previous number of 559,250.000 Person for 2016. Peru Social Security: Number of Contributors data is updated yearly, averaging 473,923.000 Person from Dec 1995 (Median) to 2017, with 23 observations. The data reached an all-time high of 568,178.000 Person in 2017 and a record low of 319,254.000 Person in 1995. Peru Social Security: Number of Contributors data remains active status in CEIC and is reported by Social Security Normalization Office. The data is categorized under Global Database’s Peru – Table PE.G010: Social Security Statistics.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institutes of Health (2025). Normalization and analysis of DNA microarray data by self-consistency and local regression [Dataset]. https://data.virginia.gov/dataset/normalization-and-analysis-of-dna-microarray-data-by-self-consistency-and-local-regression

Data from: Normalization and analysis of DNA microarray data by self-consistency and local regression

Explore at:

htmlAvailable download formats

Dataset updated

Sep 6, 2025

Dataset provided by

National Institutes of Health

Description

A robust semi-parametric normalization technique has been developed, based on the assumption that the large majority of genes will not have their relative expression levels changed from one treatment group to the next, and on the assumption that departures of the response from linearity are small and slowly varying. The method was tested using data simulated under various error models and it performs well.

Clear search

Close search

Google apps

Main menu

Data from: Normalization and analysis of DNA microarray data by...

Data and Code for: \"Universal Adaptive Normalization Scale (AMIS):...

Data from: A Statistical Approach for Identifying the Best Combination of...

Robust RT-qPCR Data Normalization: Validation and Selection of Internal...

Data Normalization Method for Geo-Spatial Analysis on Ports

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization...

Data from: Profound effect of normalization on detection of differentially...

Data from: MS-DAP Platform for Downstream Data Analysis of Label-Free...

Study comparing scaling with ranked subsampling (SRS) and rarefying for the...

Portland OR Real Estate Data 2015-2021

Context

Content

Data from: Adaptive Inference for Change Points in High-Dimensional Data

Data Analysis for the Systematic Literature Review of DL4SE

Data from: A High Statistics Measurement of the Deuteron Structure Functions...

A Python Code for Statistical Mirroring

Short-Term Adaptation to Sound Statistics is Unimpaired in Developmental...

UCI Automobile Dataset

Data from: Improved Discovery of Molecular Interactions in Genome-Scale Data...

W4M00001_Sacurine-statistics

ChIPnorm: A Statistical Method for Normalizing and Identifying Differential...

Peru Social Security: No of Contributors

Data from: Normalization and analysis of DNA microarray data by self-consistency and local regression