100+ datasets found

Data from: Normalized data
figshare.com
txt
Updated Jun 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yalbi Balderas (2022). Normalized data [Dataset]. http://doi.org/10.6084/m9.figshare.20076047.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20076047.v1
Dataset updated
Jun 15, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Yalbi Balderas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalize data
Normalized Dataset
kaggle.com
zip
Updated Jun 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hemanth S (2022). Normalized Dataset [Dataset]. https://www.kaggle.com/datasets/hemanth012/normalized-dataset
Explore at:
zip(1009250933 bytes)Available download formats
Dataset updated
Jun 15, 2022
Authors
Hemanth S
Description
Dataset

This dataset was created by Hemanth S

Contents
c
Data from: LVMED: Dataset of Latvian text normalisation samples for the...
repository.clarin.lv
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
d
WLCI - Important Agricultural Lands Assessment (Input Raster: Normalized...
catalog.data.gov
data.usgs.gov
+2more
Updated Oct 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). WLCI - Important Agricultural Lands Assessment (Input Raster: Normalized Antelope Damage Claims) [Dataset]. https://catalog.data.gov/dataset/wlci-important-agricultural-lands-assessment-input-raster-normalized-antelope-damage-claim
Explore at:
Dataset updated
Oct 30, 2025
Dataset provided by
U.S. Geological Survey
Description
The values in this raster are unit-less scores ranging from 0 to 1 that represent normalized dollars per acre damage claims from antelope on Wyoming lands. This raster is one of 9 inputs used to calculate the "Normalized Importance Index."
Identification of Novel Reference Genes Suitable for qRT-PCR Normalization...
plos.figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Hu; Shuying Xie; Jihua Yao (2023). Identification of Novel Reference Genes Suitable for qRT-PCR Normalization with Respect to the Zebrafish Developmental Stage [Dataset]. http://doi.org/10.1371/journal.pone.0149277
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0149277
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yu Hu; Shuying Xie; Jihua Yao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reference genes used in normalizing qRT-PCR data are critical for the accuracy of gene expression analysis. However, many traditional reference genes used in zebrafish early development are not appropriate because of their variable expression levels during embryogenesis. In the present study, we used our previous RNA-Seq dataset to identify novel reference genes suitable for gene expression analysis during zebrafish early developmental stages. We first selected 197 most stably expressed genes from an RNA-Seq dataset (29,291 genes in total), according to the ratio of their maximum to minimum RPKM values. Among the 197 genes, 4 genes with moderate expression levels and the least variation throughout 9 developmental stages were identified as candidate reference genes. Using four independent statistical algorithms (delta-CT, geNorm, BestKeeper and NormFinder), the stability of qRT-PCR expression of these candidates was then evaluated and compared to that of actb1 and actb2, two commonly used zebrafish reference genes. Stability rankings showed that two genes, namely mobk13 (mob4) and lsm12b, were more stable than actb1 and actb2 in most cases. To further test the suitability of mobk13 and lsm12b as novel reference genes, they were used to normalize three well-studied target genes. The results showed that mobk13 and lsm12b were more suitable than actb1 and actb2 with respect to zebrafish early development. We recommend mobk13 and lsm12b as new optimal reference genes for zebrafish qRT-PCR analysis during embryogenesis and early larval stages.
f
Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data...
frontiersin.figshare.com
application/cdfv2
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao (2023). Data_Sheet_1_NormExpression: An R Package to Normalize Gene Expression Data Using Evaluated Methods.doc [Dataset]. http://doi.org/10.3389/fgene.2019.00400.s001
Explore at:
application/cdfv2Available download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00400.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Zhenfeng Wu; Weixiang Liu; Xiufeng Jin; Haishuo Ji; Hua Wang; Gustavo Glusman; Max Robinson; Lin Liu; Jishou Ruan; Shan Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
f
Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...
datasetcatalog.nlm.nih.gov
Updated Sep 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris (2020). proteiNorm – A User-Friendly Tool for Normalization and Analysis of TMT and Label-Free Protein Quantification [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000568582
Explore at:
Dataset updated
Sep 30, 2020
Authors
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris
Description
The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
f
DataSheet1_TimeNorm: a novel normalization method for time course microbiome...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei (2024). DataSheet1_TimeNorm: a novel normalization method for time course microbiome data.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001407445
Explore at:
Dataset updated
Sep 24, 2024
Authors
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei
Description
Metagenomic time-course studies provide valuable insights into the dynamics of microbial systems and have become increasingly popular alongside the reduction in costs of next-generation sequencing technologies. Normalization is a common but critical preprocessing step before proceeding with downstream analysis. To the best of our knowledge, currently there is no reported method to appropriately normalize microbial time-series data. We propose TimeNorm, a novel normalization method that considers the compositional property and time dependency in time-course microbiome data. It is the first method designed for normalizing time-series data within the same time point (intra-time normalization) and across time points (bridge normalization), separately. Intra-time normalization normalizes microbial samples under the same condition based on common dominant features. Bridge normalization detects and utilizes a group of most stable features across two adjacent time points for normalization. Through comprehensive simulation studies and application to a real study, we demonstrate that TimeNorm outperforms existing normalization methods and boosts the power of downstream differential abundance analysis.
n
Methods for normalizing microbiome data: an ecological perspective
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
University of New England
James Cook University
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate comparisons among communities and were the only methods that fully normalized read depths across samples. Additionally, upper quartile, CSS, edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed. 4. Based on our simulations, normalizing via proportions may be superior to other commonly used methods for comparing ecological communities.
🔢🖊️ Digital Recognition: MNIST Dataset
kaggle.com
zip
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). 🔢🖊️ Digital Recognition: MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/digital-mnist-dataset
Explore at:
zip(2278207 bytes)Available download formats
Dataset updated
Nov 13, 2025
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Handwritten Digits Pixel Dataset - Documentation

Overview

The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.

Dataset Description

Basic Information

Format: CSV (Comma-Separated Values)

Total Samples: [Number of rows based on your dataset]

Features: 784 pixel columns (28×28 pixels) + 1 label column

Label Range: Digits 0-9

Pixel Value Range: 0-255 (grayscale intensity)

File Structure

Column Description

label: The target variable representing the digit (0-9)

pixel columns: 784 columns named in format [row]xcolumn

Each pixel column contains integer values from 0-255 representing grayscale intensity

Data Characteristics

Label Distribution

The dataset contains handwritten digit samples with the following distribution:

Digit 0: [X] samples

Digit 1: [X] samples

Digit 2: [X] samples

Digit 3: [X] samples

Digit 4: [X] samples

Digit 5: [X] samples

Digit 6: [X] samples

Digit 7: [X] samples

Digit 8: [X] samples

Digit 9: [X] samples

(Note: Actual distribution counts would be calculated from your specific dataset)

Data Quality

Missing Values: No missing values detected

Data Type: All values are integers

Normalization: Pixel values range from 0-255 (can be normalized to 0-1 for ML models)

Consistency: Uniform 28×28 grid structure across all samples

Technical Specifications

Data Preprocessing Requirements

Normalization: Scale pixel values from 0-255 to 0-1 range

Reshaping: Convert 1D pixel arrays to 2D 28×28 matrices for visualization

Train-Test Split: Recommended 80-20 or 70-30 split for model development

Recommended Machine Learning Approaches

Classification Algorithms:

Random Forest

Support Vector Machines (SVM)

Neural Networks

K-Nearest Neighbors (KNN)

Deep Learning Architectures:

Convolutional Neural Networks (CNNs)

Multi-layer Perceptrons (MLPs)

Dimensionality Reduction:

PCA (Principal Component Analysis)

t-SNE for visualization

Usage Examples

Loading the Dataset

import pandas as pd # Load the dataset df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv') # Separate features and labels X = df.drop('label', axis=1) y = df['label'] # Normalize pixel values X_normalized = X / 255.0
LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0135852
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.
FER_Data Smile DataSet
kaggle.com
zip
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Faheem Iqbal (2025). FER_Data Smile DataSet [Dataset]. https://www.kaggle.com/datasets/faheem113141/fer-data-smile-dataset
Explore at:
zip(303917050 bytes)Available download formats
Dataset updated
Nov 19, 2025
Authors
Muhammad Faheem Iqbal
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
📌 FER_Data Smile Dataset — Pixel-Based Facial Expression Data (CSV Format)

This dataset contains facial expression data (specifically smiling vs. non-smiling) represented in pixel format, stored inside CSV files. It is designed for training and evaluating machine learning and deep learning models for facial expression recognition (FER).

📁 Dataset Structure

The dataset includes two files:

train.csv Contains labeled pixel-based image data for training.

test.csv Contains unlabeled or labeled pixel-based image data for testing/evaluation.

📄 File Format

Each CSV file stores image data in the following structure:

pixels → A string or sequence of pixel values (grayscale), typically flattened into a single row per image.

label (in training file only) → Indicates whether the image represents Smile / Non-Smile (or other classes if applicable).

🖼 Image Details

The dataset consists of pixel-intensity values for each image.

Images are stored as flattened grayscale arrays (e.g., 48×48 = 2304 pixels).

Can be reshaped into image matrices for visualization or model training.

🎯 Use Cases

Facial Expression Recognition (FER)

Smile Detection

Emotion Classification

CNN/RNN/GNN computer vision pipelines

Pixel-based model experimentation

💡 Recommended Preprocessing

Convert pixel strings into NumPy arrays

Normalize values (e.g., divide by 255)

Reshape into required format (e.g., 48×48 for CNN)

Apply augmentations for improved model performance
S
Data from: A radiometric normalization dataset of Shandong Province based on...
scidb.cn
Updated Feb 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
黄莉婷; 焦伟利; 龙腾飞 (2020). A radiometric normalization dataset of Shandong Province based on Gaofen-1 WFV image (2018) [Dataset]. http://doi.org/10.11922/sciencedb.947
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.947
Dataset updated
Feb 20, 2020
Dataset provided by
Science Data Bank
Authors
黄莉婷; 焦伟利; 龙腾飞
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Shandong
Description
Surface reflectance is a critical physical variable that affects the energy budget in land-atmosphere interactions, feature recognition and classification, and climate change research. This dataset uses the relative radiometric normalization method, and takes the Landsat-8 Operational Land Imager (OLI) surface reflectance products as the reference image to normalize the GF-1 satellite WFV sensor cloud-free images of Shandong Province in 2018. Relative radiometric normalization processing mainly includes atmospheric correction, image resampling, image registration, mask, extract the no-change pixels and calculate normalization coefficients. After relative radiometric normalization, the no-change pixels of each GF-1 WFV image and its reference image, R2 is 0.7295 above, RMSE is below 0.0172. The surface reflectance accuracy of GF-1 WFV image is improved, which can be used in cooperation with Landsat data to provide data support for remote sensing quantitative inversion. This dataset is in GeoTIFF format, and the spatial resolution of the image is 16 m.
Residential Existing Homes (One to Four Units) Energy Efficiency Meter...
data.ny.gov
datasets.ai
+2more
csv, xlsx, xml
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The New York State Energy Research and Development Authority, New York Residential Existing Homes Program (2019). Residential Existing Homes (One to Four Units) Energy Efficiency Meter Evaluated Project Data: 2007 – 2012 [Dataset]. https://data.ny.gov/Energy-Environment/Residential-Existing-Homes-One-to-Four-Units-Energ/5vqm-4rpf
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Feb 12, 2019
Dataset provided by
New York State Energy Research and Development Authorityhttps://www.nyserda.ny.gov/
Authors
The New York State Energy Research and Development Authority, New York Residential Existing Homes Program
Description
IMPORTANT! PLEASE READ DISCLAIMER BEFORE USING DATA. This dataset backcasts estimated modeled savings for a subset of 2007-2012 completed projects in the Home Performance with ENERGY STAR® Program against normalized savings calculated by an open source energy efficiency meter available at https://www.openee.io/. Open source code uses utility-grade metered consumption to weather-normalize the pre- and post-consumption data using standard methods with no discretionary independent variables. The open source energy efficiency meter allows private companies, utilities, and regulators to calculate energy savings from energy efficiency retrofits with increased confidence and replicability of results. This dataset is intended to lay a foundation for future innovation and deployment of the open source energy efficiency meter across the residential energy sector, and to help inform stakeholders interested in pay for performance programs, where providers are paid for realizing measurable weather-normalized results. To download the open source code, please visit the website at https://github.com/openeemeter/eemeter/releases

D I S C L A I M E R: Normalized Savings using open source OEE meter. Several data elements, including, Evaluated Annual Elecric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), and Post-retrofit Usage Gas (MMBtu) are direct outputs from the open source OEE meter.

Home Performance with ENERGY STAR® Estimated Savings. Several data elements, including, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, and Estimated First Year Energy Savings represent contractor-reported savings derived from energy modeling software calculations and not actual realized energy savings. The accuracy of the Estimated Annual kWh Savings and Estimated Annual MMBtu Savings for projects has been evaluated by an independent third party. The results of the Home Performance with ENERGY STAR impact analysis indicate that, on average, actual savings amount to 35 percent of the Estimated Annual kWh Savings and 65 percent of the Estimated Annual MMBtu Savings. For more information, please refer to the Evaluation Report published on NYSERDA’s website at: http://www.nyserda.ny.gov/-/media/Files/Publications/PPSER/Program-Evaluation/2012ContractorReports/2012-HPwES-Impact-Report-with-Appendices.pdf.

This dataset includes the following data points for a subset of projects completed in 2007-2012: Contractor ID, Project County, Project City, Project ZIP, Climate Zone, Weather Station, Weather Station-Normalization, Project Completion Date, Customer Type, Size of Home, Volume of Home, Number of Units, Year Home Built, Total Project Cost, Contractor Incentive, Total Incentives, Amount Financed through Program, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, Estimated First Year Energy Savings, Evaluated Annual Electric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), Post-retrofit Usage Gas (MMBtu), Central Hudson, Consolidated Edison, LIPA, National Grid, National Fuel Gas, New York State Electric and Gas, Orange and Rockland, Rochester Gas and Electric.

How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.
ARCS White Beam Vanadium Normalization Data for SNS Cycle 2022B (May 15 -...
osti.gov
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spallation Neutron Source (SNS) (2025). ARCS White Beam Vanadium Normalization Data for SNS Cycle 2022B (May 15 - Jun., 14, 2022) [Dataset]. http://doi.org/10.14461/oncat.data/2568320
Explore at:
Unique identifier
https://doi.org/10.14461/oncat.data/2568320
Dataset updated
May 30, 2025
Dataset provided by
Department of Energy Basic Energy Sciences Programhttp://science.energy.gov/user-facilities/basic-energy-sciences/
Office of Sciencehttp://www.er.doe.gov/
Spallation Neutron Source (SNS)
Description
A data set used to normalize the detector response of the ARCS instrument see ARCS_226797.md in the data set for more details.
m
Hydroponic Thai Basil Growth
data.mendeley.com
kaggle.com
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinaya Gohokar (2025). Hydroponic Thai Basil Growth [Dataset]. http://doi.org/10.17632/vx4jy7wyvd.1
Explore at:
Unique identifier
https://doi.org/10.17632/vx4jy7wyvd.1
Dataset updated
Feb 27, 2025
Authors
Vinaya Gohokar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of key environmental and physiological parameters influencing plant growth. This includes temperature, humidity, solar radiation, pH, total dissolved solids (TDS), leaves green area, and plant height. These features were collected across 24 recorded instances from January 2, 2025, to February 3, 2025. The data was preprocessed to remove inconsistencies and normalize values to ensure model stability and robustness. The dataset includes two sets of Thai Basil images for 24 instances. It also includes csv data for features.
f
Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2020.00594.s002
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
Ames Housing Dataset Engineered
kaggle.com
zip
Updated Sep 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anish pai (2020). Ames Housing Dataset Engineered [Dataset]. https://www.kaggle.com/anishpai/ames-housing-dataset-missing
Explore at:
zip(196917 bytes)Available download formats
Dataset updated
Sep 30, 2020
Authors
anish pai
Area covered
Ames
Description
Iowa Housing Data

The original Ames data that is being used for the competition House Prices: Advanced Regression Techniques and predicting sales price is edited and engineered to suit a beginner for applying a model without worrying too much about missing data while focusing on the features.

Contents

The train data has the shape 1460x80 and test data has the shape 1458x79 with feature 'SalePrice' to be predicted for the test set. The train data has different types of features, categorical and numerical.

A detailed info about the data can be obtained from the Data Description file among other data files.

Transformations

a. Handling Missing Values: Some variables such as 'PoolQC', 'MiscFeature', 'Alley' have over 90% missing values. However from the data description, it is implied that the missing value indicates the absence of such features in a particular house. Well, most of the missing data implies the feature does not exist for the particular house on further inspection of the dataset and data description.

Similarly, features which are missing such as 'GarageType', 'GarageYrBuilt', 'BsmtExposure', etc indicated no garage in that house but also corresponding attributes such as 'GarageCars', 'GarageArea','BsmtCond' etc are set to 0.

A house on a street might have similar front lawn area to the houses in the same neighborhood, hence the missing values can be median of the values in a neighborhood.

Missing values in features such as 'SaleType', 'KitchenCond', etc have been imputed with the mode of the feature.

b. Dropping Variables: 'Utilities' attribute should be dropped from the data frame because almost all the houses have all public Utilities (E,G,W,& S) available.

c. Further exploration: The feature 'Electrical' has one missing value. The first intuition would be to drop the row. But on further inspection, the missing value is from a house built in 2006. After the 1970's all the houses have Standard Circuit Breakers & Romex 'SkBrkr' installed. So, the value can be inferred from this observation.

d. Transformation: There were some variables which are really categorical but were represented numerically such as 'MSSubClass', 'OverallCond' and 'YearSold'/'MonthSold' as they are discrete in nature. These have also been transformed to categorical variables.

e. X Normalizing the 'SalePrice' Variable: During EDA it was discovered that the Sale price of homes is right skewed. However on normalizing the skewness decreases and the (linear) models fit better. The feature is left for the user to normalize.

Finally the train and test sets were split and sale price appended to train set.

Acknowledgements

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

Inspiration

The data after the transformation done by me can easily be fitted on to a model after label encoding and normalizing features to reduce skewness. The main variable to be predicted is 'SalePrice' for the TestData csv file.
Naturalistic Neuroimaging Database
openneuro.org
Updated Apr 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds002837.v1.1.3
Dataset updated
Apr 20, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.

The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

v2.0 Changes

Overview

We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.

Normalization

Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:

# Generate a resting state (rs) timeseries (ts) # Install / load package to make fake fMRI ts # install.packages("neuRosim") library(neuRosim) # Generate a ts ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1) # 3dDetrend -normalize # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1" # Do for the full timeseries ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2)); # Do this again for a shorter version of the same timeseries ts.shorter.length <- length(ts.normalised.long)/4 ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2)); # By looking at the summaries, it can be seen that the median values become larger summary(ts.normalised.long) summary(ts.normalised.short) # Plot results for the long and short ts # Truncate the longer ts for plotting only ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length] # Give the plot a title title <- "3dDetrend -normalize for long (blue) and short (red) timeseries"; plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short))); # Add zero line lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey'); # 3dDetrend -normalize -polort 0 for long timeseries lines(ts.normalised.long.made.shorter, col='blue'); # 3dDetrend -normalize -polort 0 for short timeseries lines(ts.normalised.short, col='red');

Standardization/modernization

The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.

New afni_proc.py command line

The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

Effect on results

From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
JNCC Sentinel-2 indices Analysis Ready Data (ARD) Normalised Burn Ratio...
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Mar 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joint Nature Conservation Committee (JNCC) (2024). JNCC Sentinel-2 indices Analysis Ready Data (ARD) Normalised Burn Ratio (NBR) v1 [Dataset]. https://catalogue.ceda.ac.uk/uuid/6df6b803c2784b8ab9e03834bf9a4337
Explore at:
Dataset updated
Mar 23, 2024
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Joint Nature Conservation Committee (JNCC)
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered

Description
Sentinel Hub NBR description: To detect burned areas, the NBR-RAW index is the most appropriate choice. Using bands 8 and 12 it highlights burnt areas in large fire zones greater than 500 acres. To observe burn severity, you may subtract the post-fire NBR image from the pre-fire NBR image. Darker pixels indicate burned areas.

NBR = (NIR – SWIR) / (NIR + SWIR)

Sentinel-2 NBR = (B08 - B12) / (B08 + B12)

These data have been created by the Joint Nature Conservation Committee (JNCC) as part of a Defra Natural Capital & Ecosystem Assessment (NCEA) project to produce a regional, and ultimately national, system for detecting a change in habitat condition at a land parcel level. The first stage of the project is focused on Yorkshire, UK, and therefore the dataset includes granules and scenes covering Yorkshire and surrounding areas only. The dataset contains the following indices derived from Defra and JNCC Sentinel-2 Analysis Ready Data.

NDVI, NDMI, NDWI, NBR, and EVI files are generated for the following Sentinel-2 granules: • T30UWE • T30UXF • T30UWF • T30UXE • T31UCV • T30UYE • T31UCA

As the project continues, JNCC will expand the geographical coverage of this dataset and will provide continuous updates as ARD becomes available.