60 datasets found

f
Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data...
frontiersin.figshare.com
application/cdfv2
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001
Explore at:
application/cdfv2Available download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00400.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
f
Identification of Novel Reference Genes Suitable for qRT-PCR Normalization...
plos.figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Hu; Shuying Xie; Jihua Yao (2023). Identification of Novel Reference Genes Suitable for qRT-PCR Normalization with Respect to the Zebrafish Developmental Stage [Dataset]. http://doi.org/10.1371/journal.pone.0149277
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0149277
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Yu Hu; Shuying Xie; Jihua Yao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reference genes used in normalizing qRT-PCR data are critical for the accuracy of gene expression analysis. However, many traditional reference genes used in zebrafish early development are not appropriate because of their variable expression levels during embryogenesis. In the present study, we used our previous RNA-Seq dataset to identify novel reference genes suitable for gene expression analysis during zebrafish early developmental stages. We first selected 197 most stably expressed genes from an RNA-Seq dataset (29,291 genes in total), according to the ratio of their maximum to minimum RPKM values. Among the 197 genes, 4 genes with moderate expression levels and the least variation throughout 9 developmental stages were identified as candidate reference genes. Using four independent statistical algorithms (delta-CT, geNorm, BestKeeper and NormFinder), the stability of qRT-PCR expression of these candidates was then evaluated and compared to that of actb1 and actb2, two commonly used zebrafish reference genes. Stability rankings showed that two genes, namely mobk13 (mob4) and lsm12b, were more stable than actb1 and actb2 in most cases. To further test the suitability of mobk13 and lsm12b as novel reference genes, they were used to normalize three well-studied target genes. The results showed that mobk13 and lsm12b were more suitable than actb1 and actb2 with respect to zebrafish early development. We recommend mobk13 and lsm12b as new optimal reference genes for zebrafish qRT-PCR analysis during embryogenesis and early larval stages.
d
Methods for normalizing microbiome data: an ecological perspective
search.dataone.org
data.niaid.nih.gov
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2025). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Apr 11, 2025
Dataset provided by
Dryad Digital Repository
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
Time period covered
Oct 24, 2019
Description
Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate compariso...
Z
Data from: Adapting Phrase-based Machine Translation to Normalise Medical...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collier, Nigel (2020). Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_27354
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Limsopatham, Nut
Collier, Nigel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data and supplementary information for the paper entitled "Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages" to be published at EMNLP 2015: Conference on Empirical Methods in Natural Language Processing — September 17–21, 2015 — Lisboa, Portugal.

ABSTRACT: Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen's terms refer to a particular medical concept (i.e. text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55%.
Normalization techniques for PARAFAC modeling of urine metabolomics data
data.niaid.nih.gov
xml
Updated May 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radana Karlikova (2017). Normalization techniques for PARAFAC modeling of urine metabolomics data [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls290
Explore at:
xmlAvailable download formats
Dataset updated
May 11, 2017
Dataset provided by
IMTM, Faculty of Medicine and Dentistry, Palacky University Olomouc, Hnevotinska 5, 775 15 Olomouc, Czech Republic
Authors
Radana Karlikova
Variables measured
Sample type, Metabolomics, Sample collection time
Description
One of the body fluids often used in metabolomics studies is urine. The peak intensities of metabolites in urine are affected by the urine history of an individual resulting in dilution differences. This requires therefore normalization of the data to correct for such differences. Two normalization techniques are commonly applied to urine samples prior to their further statistical analysis. First, AUC normalization aims to normalize a group of signals with peaks by standardizing the area under the curve (AUC) within a sample to the median, mean or any other proper representation of the amount of dilution. The second approach uses specific end-product metabolites such as creatinine and all intensities within a sample are expressed relative to the creatinine intensity. Another way of looking at urine metabolomics data is by realizing that the ratios between peak intensities are the information-carrying features. This opens up possibilities to use another class of data analysis techniques designed to deal with such ratios: compositional data analysis. In this approach special transformations are defined to deal with the ratio problem. In essence, it comes down to using another distance measure than the Euclidian Distance that is used in the conventional analysis of metabolomics data. We will illustrate using this type of approach in combination with three-way methods (i.e. PARAFAC) to be used in cases where samples of some biological material are measured at multiple time points. Aim of the paper is to develop PARAFAC modeling of three-way metabolomics data in the context of compositional data and compare this with standard normalization techniques for the specific case of urine metabolomics data.
h
$\pi^{-} + p$ elastic scattering in the neighbourhood of $N^{*}_1/2$ (2190)
hepdata.net
Updated Sep 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). $\pi^{-} + p$ elastic scattering in the neighbourhood of $N^{*}_1/2$ (2190) [Dataset]. http://doi.org/10.17182/hepdata.37568.v1
Explore at:
Unique identifier
https://doi.org/10.17182/hepdata.37568.v1
Dataset updated
Sep 2, 2015
Description
THE FOLLOWING COMMENTS ARE TAKEN FROM THE PI N COMPILATION OF R.L. KELLY. THEY ARE THAT COMPILATION& apos;S COMPLETE SET OF COMMENTS FOR PAPERS RELATED TO THE SAME EXPERIMENT (DESIGNATED BUSZA69) AS THE CURRENT PAPER. (THE IDENTIFIER PRECEDING THE REFERENCE AND COMMENT FOR EACH PAPER IS FOR CROSS-REFERENCING WITHIN THESE COMMENTS ONLY AND DOES NOT NECESSARILY AGREE WITH THE SHORT CODE USED ELSEWHERE IN THE PRESENT COMPILATION.) /// BELLAMY65 [E. H. BELLAMY,PROC. ROY. SOC. (LONDON) 289,509(1965)] -- /// BUSZA67 [W. BUSZA,NC 52A,331(1967)] -- PI- P DCS FROM 2K ELASTIC EVENTS AT EACH OF 5 MOMENTA BETWEEN 1.72 AND 2.46 GEV/C. DONE AT NIMROD WITH OPTICAL SPARK CHAMBERS. THE APPARATUS IS DESCRIBED IN BELLAMY65, THE RESULTS IN BUSZA67. /// BUSZA69 [W. BUSZA,PR 180,1339(1969)] -- PI+ P DCS AT 10 MOMENTA BETWEEN 1.72 AND 2.80 GEV/C,AND PI- P DCS AT 5 MOMENTA BETWEEN 2.17 AND 2.80 GEV/C. THE DATA REPORTED IN BUSZA67 ARE ALSO REPEATED HERE. THE NEW MEASUREMENTS WERE DONE WITH AN IMPROVED VERSION OF THE APPARATUS USED BY BUSZA67. THE PI- DATA (INCLUDING BUSZA67)ARE NORMALIZED TO FORWARD DISPERSION RELATIONS,THE PI+ DATAHAS ITS OWN EXPERIMENTAL NORMALIZATION BUT NO NE IS GIVEN. WE HAVE INCREASED THE ERROR OF THE MOST FORWARD PI+ POINT AT 1.72 GEV/C BECAUSE OF AN AMBIGUOUS FOOTNOTE CONCERNING THIS POINT. /// COMMENTS FROM LOVELACE71 COMPILATION OF THESE DATA -- LOVELACE71 CLAIMS SOME USE WAS MADE OF FORWARD DISPERSION RELATIONS TO NORMALIZE THE PI+ DATA AS WELL AS THE PI-. THE FOLLOWING NORMALIZATION ERRORS AND RENORMALIZATION FACTORS ARE RECOMMENDED FOR THE PI+ P AND PI- P DIFFERENTIAL CROSS SECTIONS -- PLAB=1720 MEV/C -- NE(PI+ P)=INFIN, NE(PI- P)=INFIN. PLAB=1890 MEV/C -- RF(PI+ P)=1.245, RF(PI- P)=0.941. PLAB=2070 MEV/C -- NE(PI+ P)=INFIN, RF(PI- P)=1.224. PLAB=2170 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2270 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=INFIN. PLAB=2360 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2460 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=INFIN. PLAB=2560 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2650 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2800 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . /// COMMENTS ON MODIFICATIONS TO LOVELACE71 COMPILATION BY KELLY -- WE HAVE TAKEN ALL PI- NES TO BE INFINITE,AND ALL PI+ NES TO BE UNKNOWN. ALSO ONE MINOR MISTAKE IN THE PI- (PI+) DATA AT 2.36 (2.65) GEV/C HAS BEEN CORRECTED.. DATA ARE UNNORMALIZED OR NORMALIZED TO OTHER DATA.
m
Benchmarking Synchronization Techniques for Distributed Energy Sources:...
data.mendeley.com
Updated May 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Safa (2022). Benchmarking Synchronization Techniques for Distributed Energy Sources: Application to Open Loop Synchronization Techniques [Dataset]. http://doi.org/10.17632/y2yc5kjmkf.3
Explore at:
Unique identifier
https://doi.org/10.17632/y2yc5kjmkf.3
Dataset updated
May 3, 2022
Authors
Ahmed Safa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the raw data of the performances of three open-loop synchronization techniques using a proposed benchmark. The RMSE folder contains the data of Table.2 of the article. In the main script, the data has been normalized. The folder contains the original data for each case of the benchmark. The radar chart file contains the original data of the radar chart (Fig .14). The data has been manipulated to better present it in a radar chart format. First, the values are inverted. Then, we normalize the data according to the highest value. The method that has the highest value (better performances) will take 10, the other methods will be below that value. Formulas are included in the MS Excel file.
JNCC Sentinel-2 indices Analysis Ready Data (ARD) Normalised Burn Ratio...
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Dec 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joint Nature Conservation Committee (JNCC) (2022). JNCC Sentinel-2 indices Analysis Ready Data (ARD) Normalised Burn Ratio (NBR) v1 [Dataset]. https://catalogue.ceda.ac.uk/uuid/6df6b803c2784b8ab9e03834bf9a4337
Explore at:
Dataset updated
Dec 3, 2022
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Joint Nature Conservation Committee (JNCC)
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered

Description
Sentinel Hub NBR description: To detect burned areas, the NBR-RAW index is the most appropriate choice. Using bands 8 and 12 it highlights burnt areas in large fire zones greater than 500 acres. To observe burn severity, you may subtract the post-fire NBR image from the pre-fire NBR image. Darker pixels indicate burned areas.

NBR = (NIR – SWIR) / (NIR + SWIR)

Sentinel-2 NBR = (B08 - B12) / (B08 + B12)

These data have been created by the Joint Nature Conservation Committee (JNCC) as part of a Defra Natural Capital & Ecosystem Assessment (NCEA) project to produce a regional, and ultimately national, system for detecting a change in habitat condition at a land parcel level. The first stage of the project is focused on Yorkshire, UK, and therefore the dataset includes granules and scenes covering Yorkshire and surrounding areas only. The dataset contains the following indices derived from Defra and JNCC Sentinel-2 Analysis Ready Data.

NDVI, NDMI, NDWI, NBR, and EVI files are generated for the following Sentinel-2 granules: • T30UWE • T30UXF • T30UWF • T30UXE • T31UCV • T30UYE • T31UCA

As the project continues, JNCC will expand the geographical coverage of this dataset and will provide continuous updates as ARD becomes available.
Brain tumor MRI and CT scan
kaggle.com
Updated Oct 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chenghan pu (2022). Brain tumor MRI and CT scan [Dataset]. https://www.kaggle.com/datasets/chenghanpu/brain-tumor-mri-and-ct-scan/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 4, 2022
Dataset provided by
Kaggle
Authors
chenghan pu
Description
A novel brain tumor dataset containing 4500 2D MRI-CT slices. The original MRI and CT scans are also contained in this dataset.

Pre-processing strategy: The pre-processing data pipeline includes pairing MRI and CT scans according to a specific time interval between CT and MRI scans of the same patient, MRI image registration to a standard template, MRI-CT image registration, intensity normalization, and extracting 2D slices from 3D volumes. The pipeline can be used to obtain classic 2D MRI-CT images from 3D Dicom format MRI and CT scans, which can be directly used as the training data for the end-to-end synthetic CT deep learning networks. Detail: Pairing MRI and CT scan: If the time interval between MRI and CT scans is too long, the information in MRI and CT images will not match. Therefore, we pair MRI and CT scans according to a certain time interval between CT and MRI scans of the same patient, which should not exceed half a year. MRI image registration: Considering the differences both in the human brain and space coordinates of radiation images during scanning, the dataset must avoid individual differences and unify the coordinates, which means all the CT and MRI images should be registered to the standard template. The generated images can be more accurate after registration. The template proposed by Montreal Neurosciences Institute is called MNI ICBM 152 non-linear 6th Generation Symmetric Average Brain Stereotaxic Registration Model (MNI 152) (Grabneret al., 2006). Affine registration is first applied to register MRI scans to the MNI152 template. Intensity normalization: The registered scans have some extreme values, which introduce errors that would affect the generation accuracy. We normalize the image data and eliminated these extreme values by selecting the pixel values ranked at the top 1% and bottom 1% and replacing the original pixel values of these pixels with the pixel values of 1% and 99%. Extracting 2D slices from 3D volumes: After carrying out the registration, the 3D MRI and CT scans can be represented as 237×197×189 matrices. To ensure the compatibility between training models and inputs, each 3D image is sliced, and 4500 2D MRI-CT image pairs are selected as the final training data.

Source database: 1. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=33948305 2. https://wiki.cancerimagingarchive.net/display/Public/CPTAC-GBM 3. https://wiki.cancerimagingarchive.net/display/Public/TCGA-GBM

Patient information: Number of patients: 41

Introduction of each file: Dicom: contains the source file collected from the three websites above. data(processed): contains the processed data which are saved as .npy type. you can use the train_input.npy and train_output.npy as the input and output of the encoder-decoder structure to train the model. Test and Val input and output can be used as test and validation datasets.
d
Pi-plus-minus p elastic scattering in the 2-gev region
doi.org
hepdata.net
Updated Sep 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). Pi-plus-minus p elastic scattering in the 2-gev region [Dataset]. http://doi.org/10.17182/hepdata.6227.v1
Explore at:
Unique identifier
https://doi.org/10.17182/hepdata.6227.v1
Dataset updated
Sep 2, 2015
Description
ALL DATA IN THIS RECORD ARE REDUNDANT. I.E., THEY WERE OBTAINED DIRECTLY FROM OTHER DATA IN THIS FILE, USUALLY BY EXTRAPOLATION OR INTEGRATION.. THE FOLLOWING COMMENTS ARE TAKEN FROM THE PI N COMPILATION OF R.L. KELLY. THEY ARE THAT COMPILATION& apos;S COMPLETE SET OF COMMENTS FOR PAPERS RELATED TO THE SAME EXPERIMENT (DESIGNATED BUSZA69) AS THE CURRENT PAPER. (THE IDENTIFIER PRECEDING THE REFERENCE AND COMMENT FOR EACH PAPER IS FOR CROSS-REFERENCING WITHIN THESE COMMENTS ONLY AND DOES NOT NECESSARILY AGREE WITH THE SHORT CODE USED ELSEWHERE IN THE PRESENT COMPILATION.) /// BELLAMY65 [E. H. BELLAMY,PROC. ROY. SOC. (LONDON) 289,509(1965)] -- /// BUSZA67 [W. BUSZA,NC 52A,331(1967)] -- PI- P DCS FROM 2K ELASTIC EVENTS AT EACH OF 5 MOMENTA BETWEEN 1.72 AND 2.46 GEV/C. DONE AT NIMROD WITH OPTICAL SPARK CHAMBERS. THE APPARATUS IS DESCRIBED IN BELLAMY65, THE RESULTS IN BUSZA67. /// BUSZA69 [W. BUSZA,PR 180,1339(1969)] -- PI+ P DCS AT 10 MOMENTA BETWEEN 1.72 AND 2.80 GEV/C,AND PI- P DCS AT 5 MOMENTA BETWEEN 2.17 AND 2.80 GEV/C. THE DATA REPORTED IN BUSZA67 ARE ALSO REPEATED HERE. THE NEW MEASUREMENTS WERE DONE WITH AN IMPROVED VERSION OF THE APPARATUS USED BY BUSZA67. THE PI- DATA (INCLUDING BUSZA67)ARE NORMALIZED TO FORWARD DISPERSION RELATIONS,THE PI+ DATAHAS ITS OWN EXPERIMENTAL NORMALIZATION BUT NO NE IS GIVEN. WE HAVE INCREASED THE ERROR OF THE MOST FORWARD PI+ POINT AT 1.72 GEV/C BECAUSE OF AN AMBIGUOUS FOOTNOTE CONCERNING THIS POINT. /// COMMENTS FROM LOVELACE71 COMPILATION OF THESE DATA -- LOVELACE71 CLAIMS SOME USE WAS MADE OF FORWARD DISPERSION RELATIONS TO NORMALIZE THE PI+ DATA AS WELL AS THE PI-. THE FOLLOWING NORMALIZATION ERRORS AND RENORMALIZATION FACTORS ARE RECOMMENDED FOR THE PI+ P AND PI- P DIFFERENTIAL CROSS SECTIONS -- PLAB=1720 MEV/C -- NE(PI+ P)=INFIN, NE(PI- P)=INFIN. PLAB=1890 MEV/C -- RF(PI+ P)=1.245, RF(PI- P)=0.941. PLAB=2070 MEV/C -- NE(PI+ P)=INFIN, RF(PI- P)=1.224. PLAB=2170 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2270 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=INFIN. PLAB=2360 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2460 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=INFIN. PLAB=2560 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2650 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2800 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . /// COMMENTS ON MODIFICATIONS TO LOVELACE71 COMPILATION BY KELLY -- WE HAVE TAKEN ALL PI- NES TO BE INFINITE,AND ALL PI+ NES TO BE UNKNOWN. ALSO ONE MINOR MISTAKE IN THE PI- (PI+) DATA AT 2.36 (2.65) GEV/C HAS BEEN CORRECTED.
h
Discord-Data-Preprocessed
huggingface.co
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toaster AI (2025). Discord-Data-Preprocessed [Dataset]. https://huggingface.co/datasets/toasterai/Discord-Data-Preprocessed
Explore at:
Dataset updated
Jun 24, 2025
Dataset authored and provided by
Toaster AI
License
https://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/
Description
A pre-processed version of Discord-Data. We use the detoxified and filtered variant of the dataset's 3rd version with the following modifications:

Format is a raw IRC-like conversation. (like our previous Bluesky dataset) We normalize mentions into a format without the @ symbol. We convert emojis where possible and trim out the custom ones. We don't use the continuations of the files (part 2, 3, etc.)

Part of the SambaDialog project.
J
Identification of parameters in normal error component logit-mixture (NECLM)...
journaldata.zbw.eu
jda-test.zbw.eu
pdf, txt, zip
Updated Dec 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joan L. Walker; Moshe Ben-Akiva; Denis Bolduc; Joan L. Walker; Moshe Ben-Akiva; Denis Bolduc (2022). Identification of parameters in normal error component logit-mixture (NECLM) models (replication data) [Dataset]. http://doi.org/10.15456/jae.2022319.0717541002
Explore at:
zip(162861), zip(100325), txt(952), pdf(22305)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022319.0717541002
Dataset updated
Dec 8, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Joan L. Walker; Moshe Ben-Akiva; Denis Bolduc; Joan L. Walker; Moshe Ben-Akiva; Denis Bolduc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Although the basic structure of logit-mixture models is well understood, important identification and normalization issues often get overlooked. This paper addresses issues related to the identification of parameters in logit-mixture models containing normally distributed error components associated with alternatives or nests of alternatives (normal error component logit mixture, or NECLM, models). NECLM models include special cases such as unrestricted, fixed covariance matrices; alternative-specific variances; nesting and cross-nesting structures; and some applications to panel data. A general framework is presented for determining which parameters are identified as well as what normalization to impose when specifying NECLM models. It is generally necessary to specify and estimate NECLM models at the levels, or structural, form. This precludes working with utility differences, which would otherwise greatly simplify the identification and normalization process. Our results show that identification is not always intuitive; for example, normalization issues present in logit-mixture models are not present in analogous probit models. To identify and properly normalize the NECLM, we introduce the equality condition, an addition to the standard order and rank conditions. The identifying conditions are worked through for a number of special cases, and our findings are demonstrated with empirical examples using both synthetic and real data.
a
County Hurr Risk
hub.arcgis.com
Updated Jun 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FEMA AGOL (2020). County Hurr Risk [Dataset]. https://hub.arcgis.com/maps/FEMA::county-hurr-risk
Explore at:
Dataset updated
Jun 1, 2020
Dataset authored and provided by
FEMA AGOL
Area covered

Description
The hurricane risk index is simply the product of the cumulative hurricane strikes per coastal county and the CDC Overall Social Vulnerability Index (SVI) for the given county. We normalize the hurricane strikes data to match the SVI data classification scheme (i.e., max value at 1); however, using the raw or normalized values of hurricane strikes has no impact on the spatial pattern of the risk index. Therefore, a risk index of value 1 indicates the county has the highest hurricane strikes of all the counties and is the most vulnerable county in the nation according to the SVI index. Because the analysis is over multiple states, we use the ‘United States’ SVI dataset at the county level. Values of the index are unevenly distributed so we classify intervals using the Jenks method and the first break at 0.08 is roughly equal to the median index value. For counties north of North Carolina, the low hurricane risk is most dependent on the low number of hurricane strikes. The vast majority of the counties fall in the lowest risk category and any in the second lowest category are there because of high social vulnerability.
E
KITAB Text Reuse Data
live.european-language-grid.eu
Updated Jan 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). KITAB Text Reuse Data [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7699
Explore at:
Dataset updated
Jan 2, 2020
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
KITAB Text Reuse Data KITAB’s text reuse data is generated by running passim on the OpenITI corpus (DOI: 10.5281/zenodo.3082463). Each version is the output of a separate run. To prepare the corpus for a passim run, we chunk texts into passages of 300 tokens (~words) in length. Also, we normalize texts and remove all non-Arabic characters. The chunks, called milestones, are identified by unique ids. This dataset represents the reuse cases that have been identified among milestones. The dataset contains folders for each book. Each folder includes alignment files between that book and all other books with which passim has found instances of reuse. The reuse cases between a pair of books are represented as a list of records. Each record is an alignment that shows a pair of matched passages between two books together with statistics, such as the algorithm score, and contextual information, such as the start and end positions of aligned passages so that one can find those passages in the books. A description of the alignment fields is given in the release notes.For each dataset, we generate statistical data on the alignments between the book pairs. The data is published in an application that facilitates search, filtering, and visualizations. The link to the corresponding application is given in the release notes.KITAB is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB project (Grant Agreement No. 772989, PI Sarah Bowen Savant), hosted at Aga Khan University, London. In addition, it has received funding from the Qatar National Library to aid in the adaptation of the passim algorithm for Arabic.Note on Release Numbering: Version 2020.1.1—where 2020 is the year of the release, the first dotted number—.1—is the ordinal release number in 2020, and the second dotted number—.1—is the overall release number. The first dotted number will reset every year, while the second one will continue on increasing.Note: The very first release of the KITAB text reuse data (2019.1.1) is published here as it was too big to publish on Zenodo. To receive more information on the complete datasets please contact us via kitab-project@outlook.com (or other team members). Future releases may include part of the generated data if the size of whole data is too big to publish on Zenodo. However, the data is open access for anyone to use. We provide the detailed information on the datasets in the corresponding release notes.
A
‘The Bronson Files, Dataset 4, Field 105, 2013’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2013). ‘The Bronson Files, Dataset 4, Field 105, 2013’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-the-bronson-files-dataset-4-field-105-2013-7c96/latest
Explore at:
Dataset updated
Aug 1, 2013
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘The Bronson Files, Dataset 4, Field 105, 2013’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/392f69f2-aa43-4e90-970d-33c36e011c19 on 11 February 2022.

--- Dataset description provided by original source is as follows ---

Dr. Kevin Bronson provides this unique nitrogen and water management in wheat agricultural research dataset for compute. Ten irrigation treatments from a linear sprinkler were combined with nitrogen treatments. This dataset includes notation of field events and operations, an intermediate analysis mega-table of correlated and calculated parameters, including laboratory analysis results generated during the experimentation, plus high resolution plot level intermediate data tables of SAS process output, as well as the complete raw sensors records and logger outputs.

This data was collected during the beginning time period of our USDA Maricopa terrestrial proximal high-throughput plant phenotyping tri-metric method generation, where a 5Hz crop canopy height, temperature and spectral signature are recorded coincident to indicate a plant health status. In this early development period, our Proximal Sensing Cart Mark1 (PSCM1) platform supplants people carrying the CropCircle (CC) sensors, and with an improved view mechanical performance result.

Experimental design and operational details of research conducted are contained in related published articles, however further description of the measured data signals as well as germane commentary is herein offered.

The primary component of this dataset is the Holland Scientific (HS) CropCircle ACS-470 reflectance numbers. Which as derived here, consist of raw active optical band-pass values, digitized onboard the sensor product. Data is delivered as sequential serialized text output including the associated GPS information. Typically this is a production agriculture support technology, enabling an efficient precision application of nitrogen fertilizer. We used this optical reflectance sensor technology to investigate plant agronomic biology, as the ACS-470 is a unique performance product being not only rugged and reliable but illumination active and filter customizable.

Individualized ACS-470 sensor detector behavior and subsequent index calculation influence can be understood through analysis of white-panel and other known target measurements. When a sensor is held 120cm from a titanium dioxide white painted panel, a normalized unity value of 1.0 is set for each detector. To generate this dataset we used a Holland Scientific SC-1 device and set the 1.0 unity value (field normalize) on each sensor individually, before each data collection, and without using any channel gain boost. The SC-1 field normalization device allows a communications connection to a Windows machine, where company provided sensor control software enables the necessary sensor normalization routine, and a real-time view of streaming sensor data.

This type of active proximal multi-spectral reflectance data may be perceived as inherently “noisy”; however basic analytical description consistently resolves a biological patterning, and more advanced statistical analysis is suggested to achieve discovery. Sources of polychromatic reflectance are inherent in the environment; and can be influenced by surface features like wax or water, or presence of crystal mineralization; varying bi-directional reflectance in the proximal space is a model reality, and directed energy emission reflection sampling is expected to support physical understanding of the underling passive environmental system.

Soil in view of the sensor does decrease the raw detection amplitude of the target color returned and can add a soil reflection signal component. Yet that return accurately represents a largely two-dimensional cover and intensity signal of the target material present within each view. It does however, not represent a reflection of the plant material solely because it can contain additional features in view. Expect NDVI values greater than 0.1 when sensing plants and saturating more around 0.8, rather than the typical 0.9 of passive NDVI.

The active signal does not transmit energy to penetrate, perhaps past LAI 2.1 or less, compared to what a solar induced passive reflectance sensor would encounter. However the focus of our active sensor scan is on the uppermost expanded canopy leaves, and they are positioned to intercept the major solar energy. Active energy sensors are more easy to direct, and in our capture method we target a consistent sensor height that is 1m above the average canopy height, and maintaining a rig travel speed target around 1.5 mph, with sensors parallel to earth ground in a nadir view.

We consider these CropCircle raw detector returns to be more “instant” in generation, and “less-filtered” electronically, while onboard the “black-box” device, than are other reflectance products which produce vegetation indices as averages of multiple detector samples in time.

It is known through internal sensor performance tracking across our entire location inventory, that sensor body temperature change affects sensor raw detector returns in minor and undescribed yet apparently consistent ways.

Holland Scientific 5Hz CropCircle active optical reflectance ACS-470 sensors, that were measured on the GeoScout digital propriety serial data logger, have a stable output format as defined by firmware version.

Different numbers of csv data files were generated based on field operations, and there were a few short duration instances where GPS signal was lost, multiple raw data files when present, including white panel measurements before or after field collections, were combined into one file, with the inclusion of the null value placeholder -9999. Two CropCircle sensors, numbered 2 and 3, were used supplying data in a lined format, where variables are repeated for each sensor, creating a discrete data row for each individual sensor measurement instance.

We offer six high-throughput single pixel spectral colors, recorded at 530, 590, 670, 730, 780, and 800nm. The filtered band-pass was 10nm, except for the NIR, which was set to 20 and supplied an increased signal (including increased noise).

Dual, or tandem, CropCircle sensor paired usage empowers additional vegetation index calculations such as:
DATT = (r800-r730)/(r800-r670)
DATTA = (r800-r730)/(r800-r590)
MTCI = (r800-r730)/(r730-r670)
CIRE = (r800/r730)-1
CI = (r800/r590)-1
CCCI = NDRE/NDVIR800
PRI = (r590-r530)/(r590+r530)
CI800 = ((r800/r590)-1)
CI780 = ((r780/r590)-1)

The Campbell Scientific (CS) environmental data recording of small range (0 to 5 v) voltage sensor signals are accurate and largely shielded from electronic thermal induced influence, or other such factors by design. They were used as was descriptively recommended by the company. A high precision clock timing, and a recorded confluence of custom metrics, allow the Campbell Scientific raw data signal acquisitions a high research value generally, and have delivered baseline metrics in our plant phenotyping program. Raw electrical sensor signal captures were recorded at the maximum digital resolution, and could be re-processed in whole, while the subsequent onboard calculated metrics were often data typed at a lower memory precision and served our research analysis.

Improved Campbell Scientific data at 5Hz is presented for nine collection events, where thermal, ultrasonic displacement, and additional GPS metrics were recorded. Ultrasonic height metrics generated by the Honeywell sensor and present in this dataset, represent successful phenotypic recordings. The Honeywell ultrasonic displacement sensor has worked well in this application because of its 180Khz signal frequency that ranges 2m space. Air temperature is still a developing metric, a thermocouple wire junction (TC) placed in free air with a solar shade produced a low-confidence passive ambient air temperature.

Campbell Scientific logger derived data output is structured in a column format, with multiple sensor data values present in each data row. One data row represents one program output cycle recording across the sensing array, as there was no onboard logger data averaging or down sampling. Campbell Scientific data is first recorded in binary format onboard the data logger, and then upon data retrieval, converted to ASCII text via the PC based LoggerNet CardConvert application. Here, our full CS raw data output, that includes a four-line header structure, was truncated to a typical single row header of variable names. The -9999 placeholder value was inserted for null instances.

There is canopy thermal data from three view vantages. A nadir sensor view, and looking forward and backward down the plant row at a 30 degree angle off nadir. The high confidence Apogee Instruments SI-111 type infrared radiometer, non-contact thermometer, serial number 1052 was in a front position looking forward away from the platform, number 1023 with a nadir view was in middle position, and sensor number 1022 was in a rear position and looking back toward the platform frame, until after 4/10/2013 when the order was reversed. We have a long and successful history testing and benchmarking performance, and deploying Apogee Instruments infrared radiometers in field experimentation. They are biologically spectral window relevant sensors and return a fast update 0.2C accurate average surface temperature, derived from what is (geometrically weighted) in their field of view.

Data gaps do exist beyond null value -9999 designations, there are some instances when GPS signal was lost, or rarely on HS GeoScout logger error. GPS information may be missing at the start of data recording.
E
Data from: Urdu Summary Corpus
live.european-language-grid.eu
txt
Updated Oct 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Urdu Summary Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1391
Explore at:
txtAvailable download formats
Dataset updated
Oct 19, 2017
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Urdu Summary Corpus
Urdu summary corpus consists of 50 articles collected from various blogs. From the original HTML documents only unformatted content text was kept, removing all other things. We provide abstractive summaries of these 50 articles. After normalization, we further applied different NLP tools on the articles to generate part-of-speech tagged, morphologically analyzed, lemmatized and stemmed articles.
Urdu Summary Corpus Tools
+ Normalization is taken from [1], Diacritic marks are also removed in this step.
+ Table-lookup based Morphological Analyzer and lemmatizer is built from [3].
+ Stemmer is built from [1]
+ Table-lookup based POS tagger is built from [4]. We used unigram and bigram counts.
Commands:
Unzip USCTools.zip
Open Console
Go to USCTools directly typing: cd USCTools
For Normalization
$ java -cp bin USCTools normalize input.txt output.txt
For Lemmatization
$ java -cp bin USCTools lemmatize input.txt output.txt
For Morphological analysis
$ java -cp bin USCTools morph_analysis input.txt output.txt
For stemming by Assas-Band
$ java -cp bin USCTools stemming input.txt output.txt
For POS tagging
$ java -cp bin USCTools tagging input.txt output.txt
[1] Q.-u.-A. Akram, A. Naseer, and S. Hussain. Proceedings of the 7th Workshop on Asian Language Resources (ALR7), chapter Assas-band, an Affix- Exception-List Based Urdu Stemmer, pages 40-47. Association for Computational Linguistics, 2009.
[2] A. Gulzar. Urdu normalization utility v1.0. Technical report, Center for Language Engineering, Al-kwarzimi Institute of Computer Science (KICS), University of Engineering, Lahore, Pakistan. http://www.cle.org.pk/software/langproc/urdunormalization.htm, 2007.
[3] M. Humayoun, H. Hammarström, and A. Ranta. Urdu morphology, orthography and lexicon extraction. CAASL-2: The Second Workshop on Computational Approaches to Arabic Script-based Languages, LSA Linguistic Institute. Stanford University, California, USA., pages 21-22, 2007. http://www.lama.univ-savoie.fr/ humayoun/UrduMorph/.
[4] B. Jawaid, A. Kamran, and O. Bojar. A tagged corpus and a tagger for urdu. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA). https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5
f
Data from: Best-Matched Internal Standard Normalization in Liquid...
acs.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angela K. Boysen; Katherine R. Heal; Laura T. Carlson; Anitra E. Ingalls (2023). Best-Matched Internal Standard Normalization in Liquid Chromatography–Mass Spectrometry Metabolomics Applied to Environmental Samples [Dataset]. http://doi.org/10.1021/acs.analchem.7b04400.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.7b04400.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Angela K. Boysen; Katherine R. Heal; Laura T. Carlson; Anitra E. Ingalls
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The goal of metabolomics is to measure the entire range of small organic molecules in biological samples. In liquid chromatography–mass spectrometry-based metabolomics, formidable analytical challenges remain in removing the nonbiological factors that affect chromatographic peak areas. These factors include sample matrix-induced ion suppression, chromatographic quality, and analytical drift. The combination of these factors is referred to as obscuring variation. Some metabolomics samples can exhibit intense obscuring variation due to matrix-induced ion suppression, rendering large amounts of data unreliable and difficult to interpret. Existing normalization techniques have limited applicability to these sample types. Here we present a data normalization method to minimize the effects of obscuring variation. We normalize peak areas using a batch-specific normalization process, which matches measured metabolites with isotope-labeled internal standards that behave similarly during the analysis. This method, called best-matched internal standard (B-MIS) normalization, can be applied to targeted or untargeted metabolomics data sets and yields relative concentrations. We evaluate and demonstrate the utility of B-MIS normalization using marine environmental samples and laboratory grown cultures of phytoplankton. In untargeted analyses, B-MIS normalization allowed for inclusion of mass features in downstream analyses that would have been considered unreliable without normalization due to obscuring variation. B-MIS normalization for targeted or untargeted metabolomics is freely available at https://github.com/IngallsLabUW/B-MIS-normalization.
f
Normalization of High Dimensional Genomics Data Where the Distribution of...
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mattias Landfors; Philge Philip; Patrik Rydén; Per Stenberg (2023). Normalization of High Dimensional Genomics Data Where the Distribution of the Altered Variables Is Skewed [Dataset]. http://doi.org/10.1371/journal.pone.0027942
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0027942
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Mattias Landfors; Philge Philip; Patrik Rydén; Per Stenberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher sensitivity and lower bias than can be attained using standard and invariant normalization methods.
d
(high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output
search.dataone.org
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarrod Scott (2024). (high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output [Dataset]. https://search.dataone.org/view/urn%3Auuid%3A718e0794-b5ff-4919-95ef-4a90a7890a5b
Explore at:
Dataset updated
Aug 15, 2024
Dataset provided by
Smithsonian Research Data Repository
Authors
Jarrod Scott
Description
Output files from the 8. Metadata Analysis Workflow page of the SWELTR high-temp study. In this workflow, we compared environmental metadata with microbial communities. The workflow is split into two parts.

metadata_ssu18_wf.rdata : Part 1 contains all variables and objects for the 16S rRNA analysis. To see the Objects, in R run _load("metadata_ssu18_wf.rdata", verbose=TRUE)_

metadata_its18_wf.rdata : Part 2 contains all variables and objects for the ITS analysis. To see the Objects, in R run _load("metadata_its18_wf.rdata", verbose=TRUE)_
Additional files:

In both workflows, we run the following steps:

1) Metadata Normality Tests: Shapiro-Wilk Normality Test to test whether each matadata parameter is normally distributed.
2) Normalize Parameters: R package bestNormalize to find and execute the best normalizing transformation.
3) Split Metadata parameters into groups: a) Environmental and edaphic properties, b) Microbial functional responses, and c) Temperature adaptation properties.
4) Autocorrelation Tests: Test all possible pair-wise comparisons, on both normalized and non-normalized data sets, for each group.
5) Remove autocorrelated parameters from each group.
6) Dissimilarity Correlation Tests: Use Mantel Tests to see if any on the metadata groups are significantly correlated with the community data.
7) Best Subset of Variables: Determine which of the metadata parameters from each group are the most strongly correlated with the community data. For this we use the bioenv function from the vegan package.
8) Distance-based Redundancy Analysis: Ordination analysis of samples and metadata vector overlays using capscale, also from the vegan package.

Source code for the workflow can be found here:
https://github.com/sweltr/high-temp/blob/master/metadata.Rmd
h
Data from: Interaction of Positive Pions with Hydrogen at 600 MeV
hepdata.net
Updated 1970
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Newcomb, Peter C.A.; Newcomb, Peter C.A. (1970). Interaction of Positive Pions with Hydrogen at 600 MeV [Dataset]. http://doi.org/10.17182/hepdata.26734.v1
Explore at:
Unique identifier
https://doi.org/10.17182/hepdata.26734.v1
Dataset updated
1970
Dataset provided by
HEPData
Authors
Newcomb, Peter C.A.; Newcomb, Peter C.A.
Description
THE FOLLOWING COMMENTS ARE TAKEN FROM THE PI N COMPILATION OF R.L. KELLY. THEY ARE THAT COMPILATION& apos;S COMPLETE SET OF COMMENTS FOR PAPERS RELATED TO THE SAME EXPERIMENT (DESIGNATED NEWCOMB63) AS THE CURRENT PAPER. (THE IDENTIFIER PRECEDING THE REFERENCE AND COMMENT FOR EACH PAPER IS FOR CROSS-REFERENCING WITHIN THESE COMMENTS ONLY AND DOES NOT NECESSARILY AGREE WITH THE SHORT CODE USED ELSEWHERE IN THE PRESENT COMPILATION.) /// NEWCOMB63 [P. C. A. NEWCOMB,PR 132,1283(1963).] -- PI+ P DCS AT 725 MEV/C FROM 1245 ELASTIC EVENTS IN THE 15 INCH LRL HBC AT THE BEVATRON. DATA PRESENTED AS A TABLE OF NUMBERS OF EVENTS AND A HISTOGRAM NORMALIZED TO A TOTAL CS OF 16.1+/-0.8 MB. THE MB RECORDED HERE INCLUDES THE SPREAD IN BEAM MOMENTUM OVER THE FIDUCIAL VOLUME. /// NEWCOMB63T [P. C. A. NEWCOMB,UCB THESIS,UCRL-10563,1963.] -- A LARGER VESION OF THE HISTOGRAM IS GIVEN HERE. WE USED THE NORMALIZATION READ FROM THIS HISTOGRAM, 1 EVENT/ .1 COS(THETA) INTERVAL=.0141 MB/STER, TO NORMALIZE THE TABLE IN NEWCOMB63. /// COMMENTS ON MODIFICATIONS TO LOVELACE71 COMPILATION BY KELLY -- WE NORMALIZED THE TABLE IN NEWCOMB63 AS DESCRIBED ABOVE, RESULTING IN ONLY MINOR DIFFERENCES FROM THE LOVELACE71 VERSION WHICH WAS APPARENTLY READ DIRECTLY FROM THE HISTOGRAM IN NEWCOMB63.. DATA ARE UNNORMALIZED OR NORMALIZED TO OTHER DATA.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001

Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc

Explore at:

application/cdfv2Available download formats

Unique identifier

https://doi.org/10.3389/fgene.2019.00400.s001

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.

Clear search

Close search

Google apps

Main menu

Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data...

Identification of Novel Reference Genes Suitable for qRT-PCR Normalization...

Methods for normalizing microbiome data: an ecological perspective

Data from: Adapting Phrase-based Machine Translation to Normalise Medical...

Normalization techniques for PARAFAC modeling of urine metabolomics data

$\pi^{-} + p$ elastic scattering in the neighbourhood of $N^{*}_1/2$ (2190)

Benchmarking Synchronization Techniques for Distributed Energy Sources:...

JNCC Sentinel-2 indices Analysis Ready Data (ARD) Normalised Burn Ratio...

Brain tumor MRI and CT scan

Pi-plus-minus p elastic scattering in the 2-gev region

Discord-Data-Preprocessed

Identification of parameters in normal error component logit-mixture (NECLM)...

County Hurr Risk

KITAB Text Reuse Data

‘The Bronson Files, Dataset 4, Field 105, 2013’ analyzed by Analyst-2

Data from: Urdu Summary Corpus

Data from: Best-Matched Internal Standard Normalization in Liquid...

Normalization of High Dimensional Genomics Data Where the Distribution of...

(high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output

Data from: Interaction of Positive Pions with Hydrogen at 600 MeV

Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.docSee More Versions

Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc