100+ datasets found

Data from: Workflow for Evaluating Normalization Tools for Omics Data Using...
acs.figshare.com
txt
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleesa E. Chua; Leah D. Pfeifer; Emily R. Sekera; Amanda B. Hummon; Heather Desaire (2023). Workflow for Evaluating Normalization Tools for Omics Data Using Supervised and Unsupervised Machine Learning [Dataset]. http://doi.org/10.1021/jasms.3c00295.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/jasms.3c00295.s001
Dataset updated
Oct 28, 2023
Dataset provided by
ACS Publications
Authors
Aleesa E. Chua; Leah D. Pfeifer; Emily R. Sekera; Amanda B. Hummon; Heather Desaire
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
To achieve high quality omics results, systematic variability in mass spectrometry (MS) data must be adequately addressed. Effective data normalization is essential for minimizing this variability. The abundance of approaches and the data-dependent nature of normalization have led some researchers to develop open-source academic software for choosing the best approach. While these tools are certainly beneficial to the community, none of them meet all of the needs of all users, particularly users who want to test new strategies that are not available in these products. Herein, we present a simple and straightforward workflow that facilitates the identification of optimal normalization strategies using straightforward evaluation metrics, employing both supervised and unsupervised machine learning. The workflow offers a “DIY” aspect, where the performance of any normalization strategy can be evaluated for any type of MS data. As a demonstration of its utility, we apply this workflow on two distinct datasets, an ESI-MS dataset of extracted lipids from latent fingerprints and a cancer spheroid dataset of metabolites ionized by MALDI-MSI, for which we identified the best-performing normalization strategies.
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
datadryad.org
zip
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
University of Toronto
Universidade de São Paulo
Hospital for Sick Children
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
4
GTEx (Genotype-Tissue Expression) data normalized
data.4tu.nl
figshare.com
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erdogan Taskesen, GTEx (Genotype-Tissue Expression) data normalized [Dataset]. http://doi.org/10.4121/uuid:ec5bfa66-5531-482a-904f-b693aa999e8b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:ec5bfa66-5531-482a-904f-b693aa999e8b
Dataset provided by
TU Delft
Authors
Erdogan Taskesen
License
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
Description
This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.
d
Differenced Normalized Burn Ratio (dNBR) data of wildfires in the Sky Island...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Differenced Normalized Burn Ratio (dNBR) data of wildfires in the Sky Island Mountains of the southwestern US and northern Mexico from 2011-2017 [Dataset]. https://catalog.data.gov/dataset/differenced-normalized-burn-ratio-dnbr-data-of-wildfires-in-the-sky-island-mountains-2011-
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Southwestern United States, Mexico
Description
This dataset is composed of 97 Differenced Normalized Burn Ratio (dNBR) images. Each dNBR represents a rough measure of fire-related vegetation change for wildfires (>400 ha) that occurred in the Sky Island Mountains within the Madrean Archipelago Ecoregion of the United States and Northern Mexico. These fires occurred between 2011 and 2017 and were mapped using Landsat 7 and 8 satellite imagery.
d
A spatio-temporal normalization method for geophysical data - Dataset -...
b2find.dkrz.de
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). A spatio-temporal normalization method for geophysical data - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/de207154-c175-59c6-b928-a88cbb2618f8
Explore at:
Dataset updated
Sep 11, 2024
Description
The objective of the study was to introduce a normalization algorithm which highlights short-term, localized, non-periodic fluctuations in hyper-temporal satellite data by dividing each pixel by the mean value of its spatial neighbourhood set. The algorithm was designed to suppress signal patterns that are common in the central and surrounding pixels, utilizing spatial and temporal information at different scales. Twee folders ('Normalized_different_framesizes' en 'Retrieval_different_anomalies') zijn te groot voor upload en worden nagestuurd via SURF Filesender
Luecken Cite-seq human bone marrow 2021 preprocessing
figshare.com
hdf
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Single-cell best practices (2023). Luecken Cite-seq human bone marrow 2021 preprocessing [Dataset]. http://doi.org/10.6084/m9.figshare.23623950.v2
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23623950.v2
Dataset updated
Oct 5, 2023
Dataset provided by
figshare
Authors
Single-cell best practices
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset published by Luecken et al. 2021 which contains data from human bone marrow measured through joint profiling of single-nucleus RNA and Antibody-Derived Tags (ADTs) using the 10X 3' Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0.File Descriptioncite_quality_control.h5mu: Filtered cell by feature MuData object after quality control.cite_normalization.h5mu: MuData object of normalized data using DSB (denoised and scaled by background) normalization.cite_doublet_removal_xdbt.h5mu: MuData of data after doublet removal based on known cell type markers. Cells were removed if they were double positive for mutually exclusive markers with a DSB value >2.5.cite_dimensionality_reduction.h5mu: MuData of data after dimensionality reduction.cite_batch_correction.h5mu: MuData of data after batch correction.CitationLuecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021).Original data linkhttps://openproblems.bio/neurips_docs/data/dataset/
Cell type labels for all clustering and normalization combinations compared...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Hickey (2022). Cell type labels for all clustering and normalization combinations compared for CODEX multiplexed imaging [Dataset]. http://doi.org/10.5061/dryad.dfn2z352c
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dfn2z352c
Dataset updated
Nov 17, 2022
Dataset provided by
Stanford University
Authors
John Hickey
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.

From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 unique cell type labels. We also labeled cell types by hiearchical hand-gating data within cellengine (cellengine.com). We also created another gold standard for comparison by overclustering unormalized data with X-shift angular clustering. Finally, we created one last label as the major cell type call from each cell from all 21 cell type labels in the dataset.

Consequently the dataset has individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from (4 total). The rest are labels generated by all the clustering and normalization techniques used in the manuscript and what were compared to each other. These also were the data that were used for neighborhood analysis for the last figure of the manuscript. These are provided at all four levels of cell type level granularity (from 7 cell types to 35 cell types).
C
Municipal Building Energy Usage
data.wprdc.org
datadiscoverystudio.org
+4more
csv, xlsx
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Pittsburgh (2024). Municipal Building Energy Usage [Dataset]. https://data.wprdc.org/dataset/municipal-building-energy-usage
Explore at:
csv, xlsxAvailable download formats
Dataset updated
Jun 28, 2024
Dataset provided by
City of Pittsburgh
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set contains energy use data from 2009-2014 for 139 municipally operated buildings. Metrics include: Site & Source EUI, annual electricity, natural gas and district steam consumption, greenhouse gas emissions and energy cost. Weather-normalized data enable building performance comparisons over time, despite unusual weather events.
G
Database – all data for all years
open.canada.ca
ouvert.canada.ca
doc, html, png, zip
Updated Nov 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment and Climate Change Canada (2024). Database – all data for all years [Dataset]. https://open.canada.ca/data/en/dataset/06022cc0-a31e-4b4c-850d-d4dccda5f3ac
Explore at:
html, doc, png, zipAvailable download formats
Dataset updated
Nov 28, 2024
Dataset provided by
Environment and Climate Change Canada
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jan 1, 1993 - Dec 31, 2023
Description
The National Pollutant Release Inventory (NPRI) is Canada's public inventory of pollutant releases (to air, water and land), disposals and transfers for recycling. This database contains the full NPRI dataset from 1993 to the current reporting year. To help you navigate, a Microsoft Word file provides information on the database’s structure and schema. The database is available in Microsoft Access format (accdb). The data are in normalized or “list” format and are optimized for pivot table analyses. The data are also available in a CSV format : https://open.canada.ca/data/en/dataset/40e01423-7728-429c-ac9d-2954385ccdfb. Please consult the following resources to enhance your analysis: - Guide on using and Interpreting NPRI Data: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/using-interpreting-data.html - Access additional data from the NPRI, including datasets and mapping products: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/exploredata.html Supplemental Information This data is also available in non-proprietary CSV format on the Bulk Data page. http://open.canada.ca/data/en/dataset/40e01423-7728-429c-ac9d-2954385ccdfb These files contain data from 1993 to the latest reporting year available. These datasets are in normalized or ‘list’ format and are optimized for pivot table analyses. Supporting Projects: National Pollutant Release Inventory (NPRI)
d
GC/MS Simulated Data Sets normalized using quantile normalization
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scholtens, Denise (2023). GC/MS Simulated Data Sets normalized using quantile normalization [Dataset]. https://search.dataone.org/view/sha256%3Ac3b94a68005c6bac4212457d403eedc6d12c76d960c0b0d171bd8ec5386d9cd5
Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Scholtens, Denise
Description
1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using quantile normalization (Bolstad et al. 2003).
Dataset for: A graph-based algorithm for RNA-seq data normalization
zenodo.org
data.niaid.nih.gov
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diem-Trang Tran; Diem-Trang Tran (2020). Dataset for: A graph-based algorithm for RNA-seq data normalization [Dataset]. http://doi.org/10.5281/zenodo.2667314
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2667314
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diem-Trang Tran; Diem-Trang Tran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
mRNA-seq assays on mouse tissues were downloaded from the ENCODE project and consolidated into matrices of expression
H
GC/MS Simulated Data Sets normalized using median scaling
dataverse.harvard.edu
Updated Jan 25, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denise Scholtens (2017). GC/MS Simulated Data Sets normalized using median scaling [Dataset]. http://doi.org/10.7910/DVN/OYOLXD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/OYOLXD
Dataset updated
Jan 25, 2017
Dataset provided by
Harvard Dataverse
Authors
Denise Scholtens
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using median scaling as described in Reisetter et al.
d
LiDAR - Normalized Digital Surface Model - Tiles
catalog.data.gov
opendata.dc.gov
+1more
Updated Feb 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D.C. Office of the Chief Technology Officer (2025). LiDAR - Normalized Digital Surface Model - Tiles [Dataset]. https://catalog.data.gov/dataset/lidar-normalized-digital-surface-model-tiles
Explore at:
Dataset updated
Feb 4, 2025
Dataset provided by
D.C. Office of the Chief Technology Officer
Description
Normalizd Digital Surface Model - 1m resolution. The dataset contains the 1m Digital Surface Model for the District of Columbia.Some areas have limited data. The lidar dataset redaction was conducted under the guidance of the United States Secret Service. Except for classified ground points and classified water points, all lidar data returns and collected data were removed from the dataset within the United States Secret Service 1m redaction boundary generated for the 2017 orthophoto flight
NOAA Climate Data Record (CDR) of Normalized Difference Vegetation Index...
catalog.data.gov
ncei.noaa.gov
Updated Nov 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DOC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce (Point of Contact) (2023). NOAA Climate Data Record (CDR) of Normalized Difference Vegetation Index (NDVI), Version 4 (Version Superseded) [Dataset]. https://catalog.data.gov/dataset/noaa-climate-data-record-cdr-of-normalized-difference-vegetation-index-ndvi-version-4-version-s3
Explore at:
Dataset updated
Nov 2, 2023
Dataset provided by
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
United States Department of Commercehttp://www.commerce.gov/
National Environmental Satellite, Data, and Information Service
Description
Note: This dataset version has been superseded by a newer version. It is highly recommended that users access the current version. Users should only use this version for special cases, such as reproducing studies that used this version. This dataset contains gridded daily Normalized Difference Vegetation Index (NDVI) derived from the NOAA Climate Data Record (CDR) of Advanced Very High Resolution Radiometer (AVHRR) Surface Reflectance. The data record spans from 1981 to 10 days before the present using data from eight NOAA polar orbiting satellites: NOAA-7, -9, -11, -14, -16, -17, -18 and -19. The data are projected on a 0.05 degree x 0.05 degree global grid. This dataset is one of the Land Surface CDR products produced by the NASA Goddard Space Flight Center (GSFC) and the University of Maryland (UMD). Improvements made for Version 4 include 1) additional data from NOAA satellites extending the time period, 2) improved geolocation accuracy from use of OLE instead of TLE, 3) center of the grid is used as the reference, and 4) data value of a grid cell is computed as an average of available good observations. The dataset is in the netCDF-4 file format following ACDD and CF Conventions. The dataset is accompanied by algorithm documentation, data flow diagram and source code for the NOAA CDR Program.
d
Residential Existing Homes (One to Four Units) Energy Efficiency Meter...
catalog.data.gov
datasets.ai
+2more
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ny.gov (2023). Residential Existing Homes (One to Four Units) Energy Efficiency Meter Evaluated Project Data: 2007 – 2012 [Dataset]. https://catalog.data.gov/dataset/residential-existing-homes-one-to-four-units-energy-efficiency-meter-evaluated-projec-2007
Explore at:
Dataset updated
Sep 15, 2023
Dataset provided by
data.ny.gov
Description
IMPORTANT! PLEASE READ DISCLAIMER BEFORE USING DATA. This dataset backcasts estimated modeled savings for a subset of 2007-2012 completed projects in the Home Performance with ENERGY STAR® Program against normalized savings calculated by an open source energy efficiency meter available at https://www.openee.io/. Open source code uses utility-grade metered consumption to weather-normalize the pre- and post-consumption data using standard methods with no discretionary independent variables. The open source energy efficiency meter allows private companies, utilities, and regulators to calculate energy savings from energy efficiency retrofits with increased confidence and replicability of results. This dataset is intended to lay a foundation for future innovation and deployment of the open source energy efficiency meter across the residential energy sector, and to help inform stakeholders interested in pay for performance programs, where providers are paid for realizing measurable weather-normalized results. To download the open source code, please visit the website at https://github.com/openeemeter/eemeter/releases D I S C L A I M E R: Normalized Savings using open source OEE meter. Several data elements, including, Evaluated Annual Elecric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), and Post-retrofit Usage Gas (MMBtu) are direct outputs from the open source OEE meter. Home Performance with ENERGY STAR® Estimated Savings. Several data elements, including, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, and Estimated First Year Energy Savings represent contractor-reported savings derived from energy modeling software calculations and not actual realized energy savings. The accuracy of the Estimated Annual kWh Savings and Estimated Annual MMBtu Savings for projects has been evaluated by an independent third party. The results of the Home Performance with ENERGY STAR impact analysis indicate that, on average, actual savings amount to 35 percent of the Estimated Annual kWh Savings and 65 percent of the Estimated Annual MMBtu Savings. For more information, please refer to the Evaluation Report published on NYSERDA’s website at: http://www.nyserda.ny.gov/-/media/Files/Publications/PPSER/Program-Evaluation/2012ContractorReports/2012-HPwES-Impact-Report-with-Appendices.pdf. This dataset includes the following data points for a subset of projects completed in 2007-2012: Contractor ID, Project County, Project City, Project ZIP, Climate Zone, Weather Station, Weather Station-Normalization, Project Completion Date, Customer Type, Size of Home, Volume of Home, Number of Units, Year Home Built, Total Project Cost, Contractor Incentive, Total Incentives, Amount Financed through Program, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, Estimated First Year Energy Savings, Evaluated Annual Electric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), Post-retrofit Usage Gas (MMBtu), Central Hudson, Consolidated Edison, LIPA, National Grid, National Fuel Gas, New York State Electric and Gas, Orange and Rockland, Rochester Gas and Electric. How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.
E
Dataset of normalised Slovene text KonvNormSl 1.0
live.european-language-grid.eu
clarin.si
binary format
Updated Sep 18, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8217
Explore at:
binary formatAvailable download formats
Dataset updated
Sep 18, 2016
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data used in the experiments described in:

Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)

Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)

The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).

The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
d
2020 LiDAR - Normalized Digital Surface Model
catalog.data.gov
opendata.dc.gov
+6more
Updated Nov 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D.C. Office of the Chief Technology Officer (2024). 2020 LiDAR - Normalized Digital Surface Model [Dataset]. https://catalog.data.gov/dataset/2020-lidar-normalized-digital-surface-model
Explore at:
Dataset updated
Nov 5, 2024
Dataset provided by
D.C. Office of the Chief Technology Officer
Description
Normalizd Digital Surface Model - 1m resolution. The dataset contains the 1m Digital Surface Model for the District of Columbia. Some areas have limited data. The lidar dataset redaction was conducted under the guidance of the United States Secret Service. Except for classified ground points and classified water points, all lidar data returns and collected data were removed from the dataset within the United States Secret Service 1m redaction boundary generated for the 2017 orthophoto flight. This dataset is provided as an ArcGIS Image service. Please note, the download feature for this image service in Open Data DC provides a compressed PNG, JPEG or TIFF. The compressed GeoTIFF mosaic raster dataset is available under additional options when viewing downloads. Requests for the individual GeoTIFF set of images should be sent to open.data@dc.gov.
d
Post-processed and normalized data sets for the data processing, analysis,...
b2find.dkrz.de
Updated Jan 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Post-processed and normalized data sets for the data processing, analysis, and evaluation methods for co-design of coreless filament-wound structures - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/ab86210e-83e9-548a-b4e4-f5a9f72d1593
Explore at:
Dataset updated
Jan 21, 2025
Description
Post-processed and normalized data sets for specimens S2-0, S2-1, S2-2, S2-4, S2-8 and S2-9, used in Figure 14 of the publication: "Data processing, analysis, and evaluation methods for co-design of coreless filament-wound building systems", in the Journal of Computational Design and Engineering. The data allows the comparison of different geometrical, fabrication and structural parameters per segment of each specimen. The raw data was obtained during the robotic fabrication and mechanical testing of specimens S1, S2 and S3 for the publication "Computational co-design framework for coreless wound fibre-polymer composite structures. Journal of Computational Design and Engineering 9(2), 310-32", and the complete raw data is published in the data set "Object model data sets of the case study specimens for the computational co-design framework for coreless wound fibre-polymer composite structures (V1)". To extend the research, 6 specimens of the series S2 were chosen for further postprocessing. A representative number per segment was calculated for each data set. The fabrication data, which originally is produced per layer wound, is either accumulated or averaged for the total of layers in one segment. While for the geometrical or structural data, the average or maximum number of all bar elements in one segment was chosen. These decisions were taken to find representative values based on the experience of the researchers, and it is described in the data set. Finally, by normalizing all data with respect to the 6 specimens and all segments, the data can be analyzed in the same format, making compatible the comparison of geometrical, structural and fabrication data to find interrelations and possible reasons for the failure of the specimens during the mechanical test.
d
Data from: Online spatial normalization for real-time fMRI
datadryad.org
explore.openaire.eu
+2more
zip
Updated Jul 9, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaofei Li; Li Yao; Qing Ye; Xiaojie Zhao (2015). Online spatial normalization for real-time fMRI [Dataset]. http://doi.org/10.5061/dryad.1642b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1642b
Dataset updated
Jul 9, 2015
Dataset provided by
Dryad
Authors
Xiaofei Li; Li Yao; Qing Ye; Xiaojie Zhao
Time period covered
2015
Description
fMRI_Data_From_A_Finger_Tapping_Run_Subject_01-20The data were acquired from a finger tapping run in an rtfMRI experiment, which consisted of eight on-going runs. Twenty volunteers (age 22.3 ± 1.6, 8 females) participated in the experiment, which was approved by the Institutional Review Board (IRB) of the State Key Laboratory of Cognitive Neuroscience and Learning in Beijing Normal University; all of the subjects signed informed consent prior to scanning. The run, which lasted 4.5 min, consisted of five rest blocks and four task blocks, of which each block lasted 30 s. During the rest blocks, a text cue “REST” was shown in the center of the screen and the subjects were instructed to take a rest. In the task blocks, a text cue “PUSH” was shown, and the subjects were instructed to tap their right-hand fingers.
Sample dataset for the models trained and tested in the paper 'Can AI be...
zenodo.org
zip
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12934521
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

To use the data, clone the corresponding repository and unzip this zip file in the data folder.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aleesa E. Chua; Leah D. Pfeifer; Emily R. Sekera; Amanda B. Hummon; Heather Desaire (2023). Workflow for Evaluating Normalization Tools for Omics Data Using Supervised and Unsupervised Machine Learning [Dataset]. http://doi.org/10.1021/jasms.3c00295.s001

Data from: Workflow for Evaluating Normalization Tools for Omics Data Using Supervised and Unsupervised Machine Learning

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.1021/jasms.3c00295.s001

Dataset updated

Oct 28, 2023

Dataset provided by

ACS Publications

Authors

Aleesa E. Chua; Leah D. Pfeifer; Emily R. Sekera; Amanda B. Hummon; Heather Desaire

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

To achieve high quality omics results, systematic variability in mass spectrometry (MS) data must be adequately addressed. Effective data normalization is essential for minimizing this variability. The abundance of approaches and the data-dependent nature of normalization have led some researchers to develop open-source academic software for choosing the best approach. While these tools are certainly beneficial to the community, none of them meet all of the needs of all users, particularly users who want to test new strategies that are not available in these products. Herein, we present a simple and straightforward workflow that facilitates the identification of optimal normalization strategies using straightforward evaluation metrics, employing both supervised and unsupervised machine learning. The workflow offers a “DIY” aspect, where the performance of any normalization strategy can be evaluated for any type of MS data. As a demonstration of its utility, we apply this workflow on two distinct datasets, an ESI-MS dataset of extracted lipids from latent fingerprints and a cancer spheroid dataset of metabolites ionized by MALDI-MSI, for which we identified the best-performing normalization strategies.

Clear search

Close search

Google apps

Main menu

Data from: Workflow for Evaluating Normalization Tools for Omics Data Using...

Data from: A systematic evaluation of normalization methods and probe...

GTEx (Genotype-Tissue Expression) data normalized

Differenced Normalized Burn Ratio (dNBR) data of wildfires in the Sky Island...

A spatio-temporal normalization method for geophysical data - Dataset -...

Luecken Cite-seq human bone marrow 2021 preprocessing

Cell type labels for all clustering and normalization combinations compared...

Municipal Building Energy Usage

Database – all data for all years

GC/MS Simulated Data Sets normalized using quantile normalization

Dataset for: A graph-based algorithm for RNA-seq data normalization

GC/MS Simulated Data Sets normalized using median scaling

LiDAR - Normalized Digital Surface Model - Tiles

NOAA Climate Data Record (CDR) of Normalized Difference Vegetation Index...

Residential Existing Homes (One to Four Units) Energy Efficiency Meter...

Dataset of normalised Slovene text KonvNormSl 1.0

2020 LiDAR - Normalized Digital Surface Model

Post-processed and normalized data sets for the data processing, analysis,...

Data from: Online spatial normalization for real-time fMRI

Sample dataset for the models trained and tested in the paper 'Can AI be...

Data from: Workflow for Evaluating Normalization Tools for Omics Data Using Supervised and Unsupervised Machine Learning