68 datasets found

c
Data from: LVMED: Dataset of Latvian text normalisation samples for the...
repository.clarin.lv
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
Data from: proteiNorm – A user-friendly tool for normalization and analysis...
data.niaid.nih.gov
xml
Updated Sep 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephanie Byrum; Stephanie Diane Byrum (2021). proteiNorm – A user-friendly tool for normalization and analysis of TMT and label-free protein quantification [Dataset]. https://data.niaid.nih.gov/resources?id=pxd018152
Explore at:
xmlAvailable download formats
Dataset updated
Sep 9, 2021
Dataset provided by
UAMS
1. Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, AR 2. Arkansas Children's Research Institute
Authors
Stephanie Byrum; Stephanie Diane Byrum
Variables measured
Proteomics
Description
The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomic data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filter on peptide and sample level, followed by an evaluation of several popular normalization methods and visualization of missing value. The user then selects an adequate normalization method and one of several imputation methods used for the subsequent comparison of different differential abundance/expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results is demonstrated on a Tandem Mass Tag mass spectrometry example data set, where the proteome of three different breast cancer cell lines was profiled with and without hydroxyurea treatment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for a differential abundance/expression analysis.
n
Methods for normalizing microbiome data: an ecological perspective
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
University of New England
James Cook University
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Microbiome sequencing data often need to be normalized due to differences in read depths, and recommendations for microbiome analyses generally warn against using proportions or rarefying to normalize data and instead advocate alternatives, such as upper quartile, CSS, edgeR-TMM, or DESeq-VS. Those recommendations are, however, based on studies that focused on differential abundance testing and variance standardization, rather than community-level comparisons (i.e., beta diversity), Also, standardizing the within-sample variance across samples may suppress differences in species evenness, potentially distorting community-level patterns. Furthermore, the recommended methods use log transformations, which we expect to exaggerate the importance of differences among rare OTUs, while suppressing the importance of differences among common OTUs. 2. We tested these theoretical predictions via simulations and a real-world data set. 3. Proportions and rarefying produced more accurate comparisons among communities and were the only methods that fully normalized read depths across samples. Additionally, upper quartile, CSS, edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed. 4. Based on our simulations, normalizing via proportions may be superior to other commonly used methods for comparing ecological communities.
Spatial Normalization of Reverse Phase Protein Array Data
plos.figshare.com
tiff
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poorvi Kaushik; Evan J. Molinelli; Martin L. Miller; Weiqing Wang; Anil Korkut; Wenbin Liu; Zhenlin Ju; Yiling Lu; Gordon Mills; Chris Sander (2023). Spatial Normalization of Reverse Phase Protein Array Data [Dataset]. http://doi.org/10.1371/journal.pone.0097213
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0097213
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Poorvi Kaushik; Evan J. Molinelli; Martin L. Miller; Weiqing Wang; Anil Korkut; Wenbin Liu; Zhenlin Ju; Yiling Lu; Gordon Mills; Chris Sander
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reverse phase protein arrays (RPPA) are an efficient, high-throughput, cost-effective method for the quantification of specific proteins in complex biological samples. The quality of RPPA data may be affected by various sources of error. One of these, spatial variation, is caused by uneven exposure of different parts of an RPPA slide to the reagents used in protein detection. We present a method for the determination and correction of systematic spatial variation in RPPA slides using positive control spots printed on each slide. The method uses a simple bi-linear interpolation technique to obtain a surface representing the spatial variation occurring across the dimensions of a slide. This surface is used to calculate correction factors that can normalize the relative protein concentrations of the samples on each slide. The adoption of the method results in increased agreement between technical and biological replicates of various tumor and cell-line derived samples. Further, in data from a study of the melanoma cell-line SKMEL-133, several slides that had previously been rejected because they had a coefficient of variation (CV) greater than 15%, are rescued by reduction of CV below this threshold in each case. The method is implemented in the R statistical programing language. It is compatible with MicroVigene and SuperCurve, packages commonly used in RPPA data analysis. The method is made available, along with suggestions for implementation, at http://bitbucket.org/rppa_preprocess/rppa_preprocess/src.
G
Flight Data Normalization Platform Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Flight Data Normalization Platform Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/flight-data-normalization-platform-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Oct 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Flight Data Normalization Platform Market Outlook

According to our latest research, the global flight data normalization platform market size reached USD 1.12 billion in 2024, exhibiting robust industry momentum. The market is projected to grow at a CAGR of 10.3% from 2025 to 2033, reaching an estimated value of USD 2.74 billion by 2033. This growth is primarily driven by the increasing adoption of advanced analytics in aviation, the rising need for operational efficiency, and the growing emphasis on regulatory compliance and safety enhancements across the aviation sector.

A key growth factor for the flight data normalization platform market is the rapid digital transformation within the aviation industry. Airlines, airports, and maintenance organizations are increasingly relying on digital platforms to aggregate, process, and normalize vast volumes of flight data generated by modern aircraft systems. The transition from legacy systems to integrated digital solutions is enabling real-time data analysis, predictive maintenance, and enhanced situational awareness. This shift is not only improving operational efficiency but also reducing downtime and maintenance costs, making it an essential strategy for airlines and operators aiming to remain competitive in a highly regulated environment.

Another significant driver fueling the expansion of the flight data normalization platform market is the stringent regulatory landscape governing aviation safety and compliance. Aviation authorities worldwide, such as the Federal Aviation Administration (FAA) and the European Union Aviation Safety Agency (EASA), are mandating the adoption of advanced flight data monitoring and normalization solutions to ensure adherence to safety protocols and to facilitate incident investigation. These regulatory requirements are compelling aviation stakeholders to invest in platforms that can seamlessly normalize and analyze data from diverse sources, thereby supporting proactive risk management and compliance reporting.

Additionally, the growing complexity of aircraft systems and the proliferation of connected devices in aviation have led to an exponential increase in the volume and variety of flight data. The need to harmonize disparate data formats and sources into a unified, actionable format is driving demand for sophisticated flight data normalization platforms. These platforms enable stakeholders to extract actionable insights from raw flight data, optimize flight operations, and support advanced analytics use cases such as fuel efficiency optimization, fleet management, and predictive maintenance. As the aviation industry continues to embrace data-driven decision-making, the demand for robust normalization solutions is expected to intensify.

Regionally, North America continues to dominate the flight data normalization platform market owing to the presence of major airlines, advanced aviation infrastructure, and early adoption of digital technologies. Europe is also witnessing significant growth, driven by stringent safety regulations and increasing investments in aviation digitization. Meanwhile, the Asia Pacific region is emerging as a lucrative market, fueled by rapid growth in air travel, expanding airline fleets, and government initiatives to modernize aviation infrastructure. Latin America and the Middle East & Africa are gradually embracing these platforms, supported by ongoing efforts to enhance aviation safety and operational efficiency.

Component Analysis

The component segment of the flight data normalization platform market is broadly categorized into software, hardware, and services. The software segment accounts for the largest share, driven by the increasing adoption of advanced analytics, machine learning, and artificial intelligence technologies for data processing and normalization. Software solutions are essential for aggregating raw flight data from multiple sources, standardizing formats, and providing actionable insights for decision-makers. With the rise of clou
r
Identification of parameters in normal error component logit-mixture (NECLM)...
resodate.org
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joan L. Walker (2025). Identification of parameters in normal error component logit-mixture (NECLM) models (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9pZGVudGlmaWNhdGlvbi1vZi1wYXJhbWV0ZXJzLWluLW5vcm1hbC1lcnJvci1jb21wb25lbnQtbG9naXRtaXh0dXJlLW5lY2xtLW1vZGVscw==
Explore at:
Dataset updated
Oct 2, 2025
Dataset provided by
ZBW
ZBW Journal Data Archive
Journal of Applied Econometrics
Authors
Joan L. Walker
Description
Although the basic structure of logit-mixture models is well understood, important identification and normalization issues often get overlooked. This paper addresses issues related to the identification of parameters in logit-mixture models containing normally distributed error components associated with alternatives or nests of alternatives (normal error component logit mixture, or NECLM, models). NECLM models include special cases such as unrestricted, fixed covariance matrices; alternative-specific variances; nesting and cross-nesting structures; and some applications to panel data. A general framework is presented for determining which parameters are identified as well as what normalization to impose when specifying NECLM models. It is generally necessary to specify and estimate NECLM models at the levels, or structural, form. This precludes working with utility differences, which would otherwise greatly simplify the identification and normalization process. Our results show that identification is not always intuitive; for example, normalization issues present in logit-mixture models are not present in analogous probit models. To identify and properly normalize the NECLM, we introduce the equality condition, an addition to the standard order and rank conditions. The identifying conditions are worked through for a number of special cases, and our findings are demonstrated with empirical examples using both synthetic and real data.
G
Equipment Runtime Normalization Analytics Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Equipment Runtime Normalization Analytics Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/equipment-runtime-normalization-analytics-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 7, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Equipment Runtime Normalization Analytics Market Outlook

As per our latest research, the global Equipment Runtime Normalization Analytics market size was valued at USD 2.43 billion in 2024, exhibiting a robust year-on-year growth trajectory. The market is expected to reach USD 7.12 billion by 2033, growing at a remarkable CAGR of 12.7% during the forecast period from 2025 to 2033. This significant expansion is primarily driven by the escalating adoption of data-driven maintenance strategies across industries, the surge in digital transformation initiatives, and the increasing necessity for optimizing equipment utilization and operational efficiency.

One of the primary growth factors fueling the Equipment Runtime Normalization Analytics market is the rapid proliferation of industrial automation and the Industrial Internet of Things (IIoT). As organizations strive to minimize downtime and maximize asset performance, the need to collect, normalize, and analyze runtime data from diverse equipment becomes critical. The integration of advanced analytics platforms allows businesses to gain actionable insights, predict equipment failures, and optimize maintenance schedules. This not only reduces operational costs but also extends the lifecycle of critical assets. The convergence of big data analytics with traditional equipment monitoring is enabling organizations to transition from reactive to predictive maintenance strategies, thereby driving market growth.

Another significant growth driver is the increasing emphasis on regulatory compliance and sustainability. Industries such as energy, manufacturing, and healthcare are under mounting pressure to comply with stringent operational standards and environmental regulations. Equipment Runtime Normalization Analytics solutions offer robust capabilities to monitor and report on equipment performance, energy consumption, and emissions. By normalizing runtime data, these solutions provide a standardized view of equipment health and efficiency, facilitating better decision-making and compliance reporting. The ability to benchmark performance across multiple sites and equipment types further enhances an organization’s ability to meet regulatory requirements while pursuing sustainability goals.

The evolution of cloud computing and edge analytics technologies also plays a pivotal role in the expansion of the Equipment Runtime Normalization Analytics market. Cloud-based platforms offer scalable and flexible deployment options, enabling organizations to centralize data management and analytics across geographically dispersed operations. Edge analytics complements this by providing real-time data processing capabilities at the source, reducing latency and enabling immediate response to equipment anomalies. This hybrid approach is particularly beneficial in sectors with remote or critical infrastructure, such as oil & gas, utilities, and transportation. The synergy between cloud and edge solutions is expected to further accelerate market adoption, as organizations seek to harness the full potential of real-time analytics for operational excellence.

From a regional perspective, North America currently leads the Equipment Runtime Normalization Analytics market, owing to its advanced industrial base, high adoption of digital technologies, and strong presence of key market players. However, Asia Pacific is anticipated to witness the fastest growth over the forecast period, driven by rapid industrialization, increasing investments in smart manufacturing, and supportive government initiatives for digital transformation. Europe remains a significant market due to its focus on energy efficiency and sustainability, while Latin America and the Middle East & Africa are gradually catching up as industrial modernization accelerates in these regions.

Component Analysis

The Equipment Runtime Normalization Analytics market is segmented by component into software, hardware, and services. The software segment holds the largest share, accounti
Data for A Systemic Framework for Assessing the Risk of Decarbonization to...
zenodo.org
txt
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola (2025). Data for A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union [Dataset]. http://doi.org/10.5281/zenodo.17152310
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17152310
Dataset updated
Sep 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 18, 2025
Area covered
European Union
Description
README — Code and data
Project: LOCALISED

Work Package 7, Task 7.1

Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union

What this repo does
-------------------
Builds the Transition‑Risk Index (TRI) for EU manufacturing at NUTS‑2 × NACE Rev.2, and reproduces the article’s Figures 3–6:
• Exposure (emissions by region/sector)
• Vulnerability (composite index)
• Risk = Exposure ⊗ Vulnerability
Outputs include intermediate tables, the final analysis dataset, and publication figures.

Folder of interest
------------------
Code and data/
├─ Code/ # R scripts (run in order 1A → 5)
│ └─ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
├─ Initial data/ # Eurostat inputs imputed for missing values
├─ Derived data/ # intermediates
├─ Final data/ # final analysis-ready tables
└─ Figures/ # exported figures

Quick start
-----------
1) Open R (or RStudio) and set the working directory to “Code and data/Code”.
Example: setwd(".../Code and data/Code")
2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
1A-non-sector-data.R → 1B-sector-data.R → 1C-all-data.R → 2-reshape-data.R → 3-normalize-data-by-n-enterpr.R → 4-risk-aggregation.R → 5A-results-maps.R, 5B-results-radar.R

What each script does
---------------------
Create Initial Data — Recreate inputs
• Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
• Write the resulting inputs to Initial data/ for the analysis pipeline.

1A / 1B / 1C — Build the unified base
• Read individual Eurostat datasets (some sectoral, some only regional).
• Harmonize, aggregate, and align them into a single analysis-ready schema.
• Write aggregated outputs to Derived data/ (and/or Final data/ as needed).

2 — Reshape and enrich
• Reshapes the combined data and adds metadata.
• Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).

3 — Normalize (enterprises & min–max)
• Divide selected indicators by number of enterprises.
• Apply min–max normalization to [0.01, 0.99].
• Exposure keeps real zeros (zeros remain zero).
• Write normalized tables to Derived data/ or Final data/.

4 — Aggregate indices
• Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
– Within each dimension: equal‑weight mean of directionally aligned, [0.01,0.99]‑scaled indicators.
– Dimension scores are re‑scaled to [0.01,0.99].
• Aggregate Vulnerability: equal‑weight mean of the five dimensions.
• TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
– Policy‑intuitive properties: high E & high V → high risk; imbalances penalized (non‑compensatory).
• Output: Final data/ (main analysis tables).

5A / 5B — Visualize results
• 5A: maps and distribution plots for Exposure, Vulnerability, and Risk → Figures 3 & 4.
• 5B: comparative/radar profiles for selected countries/regions/subsectors → Figures 5 & 6.
• Outputs saved to Figures/.

Data flow (at a glance)
-----------------------
Initial data → (1A–1C) Aggregated base → (2) Tidy long file → (3) Normalized indicators → (4) Composite indices → (5) Figures
| | |
v v v
Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/

Assumptions & conventions
-------------------------
• Geography: EU NUTS‑2 regions; Sector: NACE Rev.2 manufacturing subsectors.
• Equal weights by default where no evidence supports alternatives.
• All indicators directionally aligned so that higher = greater transition difficulty.
• Relative paths assume working directory = Code/.

Reproducing the article
-----------------------
• Optionally run the codes from the Code/Create Initial Data subfolder
• Run 1A → 5B without interruption to regenerate:
– Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
– Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
– Figure 5: Drivers of risk—highest vs. lowest risk regions (example: Germany & Greece).
– Figure 6: Subsector case (e.g., basic metals) by selected regions.
• Final tables for the paper live in Final data/. Figures export to Figures/.

Requirements
------------
• R (version per your environment).
• Install any missing packages listed at the top of each script (e.g., install.packages("...")).

Troubleshooting
---------------
• “File not found”: check that the previous script finished and wrote its outputs to the expected folder.
• Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
• Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.

Provenance & citation
---------------------
• Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
• Methods: OECD composite‑indicator guidance; IPCC AR6 risk framing (see paper references).
• If you use this code, please cite the article:
A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.
f
DataSheet1_TimeNorm: a novel normalization method for time course microbiome...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei (2024). DataSheet1_TimeNorm: a novel normalization method for time course microbiome data.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001407445
Explore at:
Dataset updated
Sep 24, 2024
Authors
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei
Description
Metagenomic time-course studies provide valuable insights into the dynamics of microbial systems and have become increasingly popular alongside the reduction in costs of next-generation sequencing technologies. Normalization is a common but critical preprocessing step before proceeding with downstream analysis. To the best of our knowledge, currently there is no reported method to appropriately normalize microbial time-series data. We propose TimeNorm, a novel normalization method that considers the compositional property and time dependency in time-course microbiome data. It is the first method designed for normalizing time-series data within the same time point (intra-time normalization) and across time points (bridge normalization), separately. Intra-time normalization normalizes microbial samples under the same condition based on common dominant features. Bridge normalization detects and utilizes a group of most stable features across two adjacent time points for normalization. Through comprehensive simulation studies and application to a real study, we demonstrate that TimeNorm outperforms existing normalization methods and boosts the power of downstream differential abundance analysis.
LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0135852
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.
🔢🖊️ Digital Recognition: MNIST Dataset
kaggle.com
zip
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). 🔢🖊️ Digital Recognition: MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/digital-mnist-dataset
Explore at:
zip(2278207 bytes)Available download formats
Dataset updated
Nov 13, 2025
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Handwritten Digits Pixel Dataset - Documentation

Overview

The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.

Dataset Description

Basic Information

Format: CSV (Comma-Separated Values)

Total Samples: [Number of rows based on your dataset]

Features: 784 pixel columns (28×28 pixels) + 1 label column

Label Range: Digits 0-9

Pixel Value Range: 0-255 (grayscale intensity)

File Structure

Column Description

label: The target variable representing the digit (0-9)

pixel columns: 784 columns named in format [row]xcolumn

Each pixel column contains integer values from 0-255 representing grayscale intensity

Data Characteristics

Label Distribution

The dataset contains handwritten digit samples with the following distribution:

Digit 0: [X] samples

Digit 1: [X] samples

Digit 2: [X] samples

Digit 3: [X] samples

Digit 4: [X] samples

Digit 5: [X] samples

Digit 6: [X] samples

Digit 7: [X] samples

Digit 8: [X] samples

Digit 9: [X] samples

(Note: Actual distribution counts would be calculated from your specific dataset)

Data Quality

Missing Values: No missing values detected

Data Type: All values are integers

Normalization: Pixel values range from 0-255 (can be normalized to 0-1 for ML models)

Consistency: Uniform 28×28 grid structure across all samples

Technical Specifications

Data Preprocessing Requirements

Normalization: Scale pixel values from 0-255 to 0-1 range

Reshaping: Convert 1D pixel arrays to 2D 28×28 matrices for visualization

Train-Test Split: Recommended 80-20 or 70-30 split for model development

Recommended Machine Learning Approaches

Classification Algorithms:

Random Forest

Support Vector Machines (SVM)

Neural Networks

K-Nearest Neighbors (KNN)

Deep Learning Architectures:

Convolutional Neural Networks (CNNs)

Multi-layer Perceptrons (MLPs)

Dimensionality Reduction:

PCA (Principal Component Analysis)

t-SNE for visualization

Usage Examples

Loading the Dataset

import pandas as pd # Load the dataset df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv') # Separate features and labels X = df.drop('label', axis=1) y = df['label'] # Normalize pixel values X_normalized = X / 255.0
Naturalistic Neuroimaging Database
openneuro.org
Updated Apr 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds002837.v1.1.3
Dataset updated
Apr 20, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.

The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

v2.0 Changes

Overview

We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.

Normalization

Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:

# Generate a resting state (rs) timeseries (ts) # Install / load package to make fake fMRI ts # install.packages("neuRosim") library(neuRosim) # Generate a ts ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1) # 3dDetrend -normalize # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1" # Do for the full timeseries ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2)); # Do this again for a shorter version of the same timeseries ts.shorter.length <- length(ts.normalised.long)/4 ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2)); # By looking at the summaries, it can be seen that the median values become larger summary(ts.normalised.long) summary(ts.normalised.short) # Plot results for the long and short ts # Truncate the longer ts for plotting only ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length] # Give the plot a title title <- "3dDetrend -normalize for long (blue) and short (red) timeseries"; plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short))); # Add zero line lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey'); # 3dDetrend -normalize -polort 0 for long timeseries lines(ts.normalised.long.made.shorter, col='blue'); # 3dDetrend -normalize -polort 0 for short timeseries lines(ts.normalised.short, col='red');

Standardization/modernization

The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.

New afni_proc.py command line

The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

Effect on results

From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
UniCourt Court Data API - USA Court Records (AI Normalized)
datarade.ai
.json, .csv, .xls
Updated Jul 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UniCourt (2022). UniCourt Court Data API - USA Court Records (AI Normalized) [Dataset]. https://datarade.ai/data-products/court-data-api-unicourt-2c86
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jul 8, 2022
Dataset provided by
Unicourt
Authors
UniCourt
Area covered
United States
Description
UniCourt simplifies access to structured court records with our Court Data API, so you can search court cases via API, get real-time alerts with webhooks, streamline your account management, and get bulk access to the AI normalized court data you need.

Search Court Cases with APIs

• Leverage UniCourt’s easy API integrations to search state and federal (PACER) court records directly from your own internal applications and systems. • Access the docket entries and case details you need on the parties, attorneys, law firms, and judges involved in litigation. • Conduct the same detailed case searches you can in our app with our APIs and easily narrow your search results using our jurisdiction, case type, and case status filters. • Use our Related Cases API to search for and download all of the court data for consolidated cases from the Judicial Panel on Multidistrict Litigation, as well as associated civil and criminal cases from U.S. District Courts.

Get Real-Time Alerts with Webhooks

• UniCourt’s webhooks provide you with industry leading automation tools for real-time push notifications to your internal applications for all your case tracking needs. • Get daily court data feeds with new case results for your automated court searches pushed directly to your applications in a structured format. • Use our custom search file webhook to search for and track thousands of entities at once and receive your results packaged into a custom CSV file. • Avoid making multiple API calls to figure out if a case has updates or not and remove the need to continuously check the status of large document orders and updates.

Bulk Access to Court Data

• UniCourt downloads thousands of new cases everyday from state and federal courts, and we structure them, normalize them with our AI, and make them accessible in bulk via our Court Data API. • Our rapidly growing CrowdSourced Library™ provides you with a massive free repository of 100+ million court cases, tens of millions of court documents, and billions of docket entries all at your fingertips. • Leverage your bulk access to AI normalized court data that’s been enriched with other public data sets to build your own analytics, competitive intelligence, and machine learning models.

Streamlined Account Management

• Easily manage your UniCourt account with information on your billing cycle and billing usage delivered to you via API. • Eliminate the requirement of logging in to your account to get a list of all of your invoices and use our APIs to directly download the invoices you need. • Get detailed data on which cases are being tracked by the users for your account and access all of the related tracking schedules for cases your users are tracking. • Gather complete information on the saved searches being run by your account, including the search parameters, filters, and much more.
e
Quantitative analysis of murine T-cells
ebi.ac.uk
data.niaid.nih.gov
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Schmidt, Quantitative analysis of murine T-cells [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD000543
Explore at:
Authors
Alexander Schmidt
Variables measured
Proteomics
Description
1D-LC-MS/MS analysis of OGE-prefractionated and TMT labeled mouse samples consiting of 2 unstimulated and 2 stimulated T-cell samples. The acquired raw-files were converted to the mascot generic file (mgf) format using the msconvert tool (part of ProteoWizard, version 3.0.4624 (2013-6-3)). Using the MASCOT algorithm (Matrix Science, Version 2.4.0), the mgf files were searched against a decoy database containing normal and reverse sequences of the predicted SwissProt entries of mus musculus (www.ebi.ac.uk, release date 16/05/2012) and commonly observed contaminants (in total 33,832 sequences) generated using the SequenceReverser tool from the MaxQuant software (Version 1.0.13.13). The precursor ion tolerance was set to 10 ppm and fragment ion tolerance was set to 0.01 Da. The search criteria were set as follows: full tryptic specificity was required (cleavage after lysine or arginine residues unless followed by proline), 2 missed cleavages were allowed, carbamidomethylation (C), TMT6plex (K and peptide n-terminus) were set as fixed modification and oxidation (M) as a variable modification. Next, the database search results were imported to the Scaffold Q+ software (version 4.1.1, Proteome Software Inc., Portland, OR) and the protein false identification rate was set to 1% based on the number of decoy hits. Specifically, peptide identifications were accepted if they could be established at greater than 94.0% probability to achieve an FDR less than 1.0% by the scaffold local FDR algorithm. Protein identifications were accepted if they could be established at greater than 6.0% probability to achieve an FDR less than 1.0% and contained at least 1 identified peptide. Protein probabilities were assigned by the Protein Prophet program (Nesvizhskii, et al, Anal. Chem. 2003; 75(17):4646-58). Proteins that contained similar peptides and could not be differentiated based on MS/MS analysis alone were grouped to satisfy the principles of parsimony. Proteins sharing significant peptide evidence were grouped into clusters. For quantification, acquired reporter ion intensities in the experiment were globally normalized across all acquisition runs. Individual quantitative samples were normalized within each acquisition run. Intensities for each peptide identification were normalized within the assigned protein. The reference channels were normalized to produce a 1:1 fold change. All normalization calculations were performed using medians to multiplicatively normalize data. A list of identified and quantified proteins is available in the xls file.
f
Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2020.00594.s002
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
d
(high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output
search.dataone.org
smithsonian.figshare.com
+1more
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarrod Scott (2024). (high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output [Dataset]. https://search.dataone.org/view/urn%3Auuid%3A718e0794-b5ff-4919-95ef-4a90a7890a5b
Explore at:
Dataset updated
Aug 15, 2024
Dataset provided by
Smithsonian Research Data Repository
Authors
Jarrod Scott
Description
Output files from the 8. Metadata Analysis Workflow page of the SWELTR high-temp study. In this workflow, we compared environmental metadata with microbial communities. The workflow is split into two parts.

metadata_ssu18_wf.rdata : Part 1 contains all variables and objects for the 16S rRNA analysis. To see the Objects, in R run _load("metadata_ssu18_wf.rdata", verbose=TRUE)_

metadata_its18_wf.rdata : Part 2 contains all variables and objects for the ITS analysis. To see the Objects, in R run _load("metadata_its18_wf.rdata", verbose=TRUE)_
Additional files:

In both workflows, we run the following steps:

1) Metadata Normality Tests: Shapiro-Wilk Normality Test to test whether each matadata parameter is normally distributed.
2) Normalize Parameters: R package bestNormalize to find and execute the best normalizing transformation.
3) Split Metadata parameters into groups: a) Environmental and edaphic properties, b) Microbial functional responses, and c) Temperature adaptation properties.
4) Autocorrelation Tests: Test all possible pair-wise comparisons, on both normalized and non-normalized data sets, for each group.
5) Remove autocorrelated parameters from each group.
6) Dissimilarity Correlation Tests: Use Mantel Tests to see if any on the metadata groups are significantly correlated with the community data.
7) Best Subset of Variables: Determine which of the metadata parameters from each group are the most strongly correlated with the community data. For this we use the bioenv function from the vegan package.
8) Distance-based Redundancy Analysis: Ordination analysis of samples and metadata vector overlays using capscale, also from the vegan package.

Source code for the workflow can be found here:
https://github.com/sweltr/high-temp/blob/master/metadata.Rmd
G
Energy Baseline Normalization Tool Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Energy Baseline Normalization Tool Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/energy-baseline-normalization-tool-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Oct 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Energy Baseline Normalization Tool Market Outlook

According to our latest research, the global Energy Baseline Normalization Tool market size reached USD 1.42 billion in 2024, with a robust compound annual growth rate (CAGR) of 12.8% projected through 2033. By the end of this forecast period, the market is expected to expand significantly, reaching a value of USD 4.09 billion by 2033. The primary growth factor driving this market is the increasing demand for precise energy management solutions across industries, propelled by regulatory pressures and a global shift toward sustainability and decarbonization.

The growth of the Energy Baseline Normalization Tool market is primarily fueled by the growing need for organizations to accurately measure, monitor, and optimize their energy consumption. With energy costs representing a significant portion of operational expenses, businesses are under mounting pressure to adopt advanced solutions that provide actionable insights into energy usage patterns. These tools facilitate the normalization of energy baselines, accounting for variables such as weather, occupancy, and production changes, thereby enabling organizations to make informed decisions, validate energy savings, and comply with international energy standards. This trend is particularly pronounced in energy-intensive sectors, such as manufacturing, utilities, and commercial real estate, where even marginal improvements in energy efficiency can yield substantial cost savings and competitive advantages.

Another critical factor driving the expansion of the Energy Baseline Normalization Tool market is the proliferation of energy efficiency regulations and sustainability reporting requirements worldwide. Governments and regulatory bodies are increasingly mandating rigorous energy tracking and reporting, particularly in the context of climate change mitigation and net-zero commitments. As a result, organizations are compelled to invest in sophisticated tools that not only track energy consumption but also normalize baselines to reflect true performance improvements. This regulatory environment is fostering widespread adoption of baseline normalization solutions, especially among large enterprises and public sector entities aiming to enhance transparency, accountability, and compliance with evolving energy standards.

Technological advancements and digital transformation initiatives are further accelerating the adoption of Energy Baseline Normalization Tools. The integration of artificial intelligence, machine learning, and IoT-enabled sensors has revolutionized the capabilities of these tools, allowing for real-time data collection, advanced analytics, and predictive modeling. This technological evolution has made energy normalization more accurate, scalable, and accessible for organizations of all sizes. Additionally, the increasing availability of cloud-based deployment options has reduced barriers to entry, enabling even small and medium-sized enterprises to leverage these solutions without significant upfront investments in IT infrastructure. This democratization of technology is expected to continue driving market growth over the coming years.

Regionally, North America remains the dominant market for Energy Baseline Normalization Tools, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading technology providers, coupled with stringent energy efficiency regulations and a mature industrial base, has positioned North America at the forefront of market adoption. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid industrialization, urbanization, and increasing investment in smart infrastructure and sustainability initiatives. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as governments and businesses in these regions ramp up efforts to enhance energy efficiency and reduce carbon emissions.

Component Analysis

The <
o
The Challenge of Stability in High-Throughput Gene Expression Analysis:...
omicsdi.org
xml
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emma Carmelo, The Challenge of Stability in High-Throughput Gene Expression Analysis: Comprehensive Selection and Evaluation of Reference Genes for BALB/c Mice Spleen Samples in the Leishmania infantum Infection Model. [Dataset]. https://www.omicsdi.org/dataset/geo/GSE80709
Explore at:
xmlAvailable download formats
Authors
Emma Carmelo
Variables measured
Other
Description
The interaction of Leishmania with BALB/c mice induces dramatic changes in transcriptome patterns in the parasite, but also in the target organs (spleen, liver…) due to its response against infection. Real-time quantitative PCR (qPCR) is an interesting approach to analyze these changes and understand the immunological pathways that lead to protection or progression of disease. However, qPCR results need to be normalized against one or more reference genes (RG) to correct for non-specific experimental variation. The development of technical platforms for high-throughput qPCR analysis, and powerful software for analysis of qPCR data, have acknowledged the problem that some reference genes widely used due to their known or suspected "housekeeping" roles, should be avoided due to high expression variability across different tissues or experimental conditions. In this paper we evaluated the stability of 112 genes using three different algorithms: geNorm, NormFinder and RefFinder in spleen samples from BALB/c mice under different experimental conditions (control and Leishmania infantum-infected mice). Despite minor discrepancies in the stability ranking shown by the three methods, most genes show very similar performance as RG (either good or poor) across this massive data set. Our results show that some of the genes traditionally used as RG in this model (i.e. B2m, Polr2a and Tbp) are clearly outperformed by others. In particular, the combination of Il2rg + Itgb2 was identified among the best scoring candidate RG for every group of mice and every algorithm used in this experimental model. Finally, we have demonstrated that using "traditional" vs rationally-selected RG for normalization of gene expression data may lead to loss of statistical significance of gene expression changes when using large-scale platforms, and therefore misinterpretation of results. Taken together, our results highlight the need for a comprehensive, high-throughput search for the most stable reference genes in each particular experimental model Overall design: 47 BALB/c mice (14-15 weeks old) were used in this study. Mice were randomly separated in two groups: (i) 23 control mice and (ii) 24 mice that were infected with 10^6 stationary-phase L. infantum promastigotes via tail vein. Mice were euthanized by cervical dislocation and spleens were removed and immediately stored in RNAlater at -70C. After RNA extraction and reverse transcription, Real-time PCR was performed using QuantStudio 12K Flex Real-Time PCR System following manufacturer’s instructions, using Custom TaqMan OpenArray Real-Time PCR Plates. Ct values obtained from RT-qPCR were analized by three different algorithms (geNorm, NormFinder and RefFinder) in order to evaluate the stability of the 112 genes and identify the most suitable for normalization of gene expression. Please note that the three algorithms were used for the identification of the best Reference genes in all samples. Once identified, those RG were used to normalize gene expression using geNorm only. The sample data table includes the normalized data using Il2rg+Itgb2 as reference genes, as identified and validated in the associated publication. The 'geNorm_Polr2a_Tbp_normalized.txt' includes the data normalized using Polr2a+Tbp as reference genes, two reference genes traditionally used in the literature for this model.

Thyroid Gland Dataset

kaggle.com

zip

Updated Sep 2, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdelaziz Sami (2024). Thyroid Gland Dataset [Dataset]. https://www.kaggle.com/abdelazizsami/thyroid-gland-dataset

Explore at:

zip(49962 bytes)Available download formats

Dataset updated

Sep 2, 2024

Authors

Abdelaziz Sami

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The Thyroid Gland Dataset typically used in machine learning and data science projects is designed to analyze and predict thyroid-related conditions. Here’s an overview of what you might find in such a dataset:

Dataset Overview

Purpose:
- To predict thyroid conditions based on various features such as age, symptoms, and test results.
- To explore relationships between different features and thyroid disease.
Common Columns:
- Age: Age of the patient.
- Lithium: Indicates if the patient is taking lithium (often used as a mood stabilizer).
- Goitre: Presence of goitre (an enlarged thyroid gland).
- Tumor: Presence of a thyroid tumor.
- Hypopituitary: Indicates if there is hypopituitarism (a condition where the pituitary gland is underactive).
- Psych: Psychological condition of the patient.
- TSH (Thyroid Stimulating Hormone): A hormone that stimulates thyroid function.
- T3 (Triiodothyronine): A thyroid hormone.
- TT4 (Total Thyroxine): A measure of the total thyroxine level in the blood.
- T4U (Thyroxine Uptake): A measure of how well the thyroid is functioning.
- FTI (Free Thyroxine Index): An index used to assess thyroid function.
Target Variable:
- Target: Indicates the presence or absence of thyroid disease. This could be binary (e.g., 0 for no disease, 1 for disease) or categorical.
Possible Features and Analysis:
- Exploratory Data Analysis (EDA): Understand the distribution of features, check for missing values, outliers, and correlations.
- Feature Encoding: Convert categorical features into numerical values if necessary for machine learning models.
- Data Visualization: Create histograms, heatmaps, and scatter plots to visualize relationships between features.

Example Dataset Information

Column	Description
`age`	Age of the patient
`lithium`	Indicates if the patient is taking lithium (0 or 1)
`goitre`	Presence of goitre (0 or 1)
`tumor`	Presence of a thyroid tumor (0 or 1)
`hypopituitary`	Indicates if there is hypopituitarism (0 or 1)
`psych`	Psychological condition (0 or 1)
`TSH`	Thyroid Stimulating Hormone level
`T3`	Triiodothyronine level
`TT4`	Total Thyroxine level
`T4U`	Thyroxine Uptake
`FTI`	Free Thyroxine Index
`target`	Presence of thyroid disease (0 or 1)

Usage

Predictive Modeling: Train models to predict thyroid conditions based on the features.
Feature Importance: Determine which features are most important for predicting the target variable.
Data Cleaning and Preparation: Handle missing values, encode categorical variables, and normalize data if required.

Example Code to Display Dataset Info

import pandas as pd

# Load the dataset
df = pd.read_csv('/kaggle/input/thyroid-gland-dataset/hypothyroid.csv')

# Display dataset information
print("Dataset Info:")
print(df.info())

# Display the first few rows
print("
First few rows of the dataset:")
print(df.head())

# Describe the dataset
print("
Dataset Description:")
print(df.describe())

This code will give you an overview of the dataset, including data types, missing values, and basic statistics.

E
Data from: Dataset of normalised Slovene text KonvNormSl 1.0
live.european-language-grid.eu
binary format
Updated Sep 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8217
Explore at:
binary formatAvailable download formats
Dataset updated
Sep 18, 2016
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data used in the experiments described in:

Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)

Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)

The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).

The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.

Facebook

Twitter

Click to copy link

Link copied

Cite

Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85

Data from: LVMED: Dataset of Latvian text normalisation samples for the medical domain

Explore at:

Dataset updated

May 30, 2023

Authors

Viesturs Jūlijs Lasmanis; Normunds Grūzītis

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

Clear search

Close search

Google apps

Main menu

Data from: LVMED: Dataset of Latvian text normalisation samples for the...

Data from: proteiNorm – A user-friendly tool for normalization and analysis...

Methods for normalizing microbiome data: an ecological perspective

Spatial Normalization of Reverse Phase Protein Array Data

Flight Data Normalization Platform Market Research Report 2033

Flight Data Normalization Platform Market Outlook

Component Analysis

Identification of parameters in normal error component logit-mixture (NECLM)...

Equipment Runtime Normalization Analytics Market Research Report 2033

Equipment Runtime Normalization Analytics Market Outlook

Component Analysis

Data for A Systemic Framework for Assessing the Risk of Decarbonization to...

DataSheet1_TimeNorm: a novel normalization method for time course microbiome...

LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...

🔢🖊️ Digital Recognition: MNIST Dataset

Handwritten Digits Pixel Dataset - Documentation

Overview

Dataset Description

Basic Information

File Structure

Column Description

Data Characteristics

Label Distribution

Data Quality

Technical Specifications

Data Preprocessing Requirements

Recommended Machine Learning Approaches

Classification Algorithms:

Deep Learning Architectures:

Dimensionality Reduction:

Usage Examples

Loading the Dataset

Naturalistic Neuroimaging Database

Overview

v2.0 Changes

UniCourt Court Data API - USA Court Records (AI Normalized)

Quantitative analysis of murine T-cells

Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

(high-temp) No 8. Metadata Analysis (16S rRNA/ITS) Output

Energy Baseline Normalization Tool Market Research Report 2033

Energy Baseline Normalization Tool Market Outlook

Component Analysis

The Challenge of Stability in High-Throughput Gene Expression Analysis:...

Thyroid Gland Dataset

Dataset Overview

Example Dataset Information

Usage

Example Code to Display Dataset Info

Data from: Dataset of normalised Slovene text KonvNormSl 1.0

Data from: LVMED: Dataset of Latvian text normalisation samples for the medical domain