Facebook
TwitterThis project is a data analysis package to analyze and normalize SHAPE data. Traditionally, SHAPE requires the addition of a 3' hairpin to the RNA for normalization. noRNAlize elminates the need for this experimental step by performing a global analysis of the SHAPE data, and establishing mean protection values. This is particularly important when SHAPE analysis is used to map crystal contacts in crystal structures as illustrated here.
This project includes the following software/data packages:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a dataset of spectrogram images created from the train_spectrograms parquet data from the Harvard Medical School Harmful Brain Activity Classification competition. The parquet files have been transformed with the following code, referencing the HMS-HBAC: KerasCV Starter Notebook
def process_spec(spec_id, split="train"):
# read the data
data = pd.read_parquet(path/f'{split}_spectrograms'/f'{spec_id}.parquet')
# read the label
label = unique_df[unique_df.spectrogram_id == spec_id]["target"].item()
# replace NA with 0
data = data.fillna(0)
# convert DataFrame to array
data = data.values[:, 1:]
# transpose
data = data.T
data = data.astype("float32")
# clip data to avoid 0s
data = np.clip(data, math.exp(-4), math.exp(8))
# take log data to magnify differences
data = np.log(data)
# normalize data
data=(data-data.mean())/data.std() + 1e-6
# convert to 3 channels
data = np.tile(data[..., None], (1, 1, 3))
# convert array to PILImage
im = PILImage.create(Image.fromarray((data * 255).astype(np.uint8)))
im.save(f"{SPEC_DIR}/{split}_spectrograms/{label}/{spec_id}.png")
Facebook
TwitterOne of the body fluids often used in metabolomics studies is urine. The peak intensities of metabolites in urine are affected by the urine history of an individual resulting in dilution differences. This requires therefore normalization of the data to correct for such differences. Two normalization techniques are commonly applied to urine samples prior to their further statistical analysis. First, AUC normalization aims to normalize a group of signals with peaks by standardizing the area under the curve (AUC) within a sample to the median, mean or any other proper representation of the amount of dilution. The second approach uses specific end-product metabolites such as creatinine and all intensities within a sample are expressed relative to the creatinine intensity. Another way of looking at urine metabolomics data is by realizing that the ratios between peak intensities are the information-carrying features. This opens up possibilities to use another class of data analysis techniques designed to deal with such ratios: compositional data analysis. In this approach special transformations are defined to deal with the ratio problem. In essence, it comes down to using another distance measure than the Euclidian Distance that is used in the conventional analysis of metabolomics data. We will illustrate using this type of approach in combination with three-way methods (i.e. PARAFAC) to be used in cases where samples of some biological material are measured at multiple time points. Aim of the paper is to develop PARAFAC modeling of three-way metabolomics data in the context of compositional data and compare this with standard normalization techniques for the specific case of urine metabolomics data.
Facebook
TwitterSynthetic Khmer OCR - Pre-processed Chunks
This dataset is a pre-processed and optimized version of the original "Synthetic Khmer OCR" dataset. All word images have been cropped, resized with padding to a uniform size, and stored in highly efficient PyTorch tensor chunks for extremely fast loading during model training.
This format is designed to completely eliminate the I/O bottleneck that comes from reading millions of individual small image files, allowing you to feed a powerful GPU without any waiting. Why This Format?
Extreme Speed: Loading a single chunk of 100,000 images from one file is hundreds of times faster than loading 100,000 individual PNG files.
No More Pre-processing: All images are already cropped and resized. The data is ready for training right out of the box.
Memory Efficient: The dataset is split into manageable chunks, so you don't need to load all ~34GB of data into RAM at once.
Data Structure
The dataset is organized into two main folders: train and val.
/ ├── train/ │ ├── train_chunk_0.pt │ ├── train_chunk_1.pt │ └── ... (and so on for all training chunks) └── val/ ├── val_chunk_0.pt ├── val_chunk_1.pt └── ... (and so on for all validation chunks)
Inside Each Chunk File (.pt)
Each .pt file is a standard PyTorch file containing a single Python dictionary with two keys: 'images' and 'labels'.
'images':
Type: torch.Tensor
Shape: (N, 3, 40, 80), where N is the number of samples in the chunk (typically 100,000).
Data Type (dtype): torch.uint8 (values from 0-255). This is done to save a massive amount of disk space. You will need to convert this to float and normalize it before feeding it to a model.
Description: This tensor contains N raw, uncompressed image pixels. Each image is a 3-channel (RGB) color image with a height of 40 pixels and a width of 64 pixels.
'labels':
Type: list of str
Length: N (matches the number of images in the tensor).
Description: This is a simple Python list of strings. The label at labels[i] corresponds to the image at images[i].
How to Use This Dataset in PyTorch
Here is a simple example of how to load a chunk and access the data. ``` import torch from torchvision import transforms from PIL import Image
chunk_path = 'train/train_chunk_0.pt' data_chunk = torch.load(chunk_path)
image_tensor_chunk = data_chunk['images'] labels_list = data_chunk['labels']
print(f"Loaded chunk: {chunk_path}") print(f"Image tensor shape: {image_tensor_chunk.shape}") print(f"Number of labels: {len(labels_list)}")
index = 42 image_uint8 = image_tensor_chunk[index] label = labels_list[index]
print(f" --- Sample at index {index} ---") print(f"Label: {label}") print(f"Image tensor shape (as saved): {image_uint8.shape}") print(f"Image data type (as saved): {image_uint8.dtype}")
image_float = image_uint8.float() / 255.0
normalize_transform = transforms.Normalize( mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] )
normalized_image = normalize_transform(image_float)
print(f" Image tensor shape (normalized): {normalized_image.shape}") print(f"Image data type (normalized): {normalized_image.dtype}") print(f"Min value: {normalized_image.min():.2f}, Max value: {normalized_image.max():.2f}")
image_to_view = transforms.ToPILImage()(image_float)
print(" Successfully prepared a sample for model input and viewing!") ```
Facebook
TwitterAnalysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
Facebook
TwitterSynthetic Khmer OCR - Pre-processed Chunks
This dataset is a pre-processed and optimized version of the original "Synthetic Khmer OCR" dataset. All word images have been cropped, resized with padding to a uniform size, and stored in highly efficient PyTorch tensor chunks for extremely fast loading during model training.
This format is designed to completely eliminate the I/O bottleneck that comes from reading millions of individual small image files, allowing you to feed a powerful GPU without any waiting. Why This Format?
Extreme Speed: Loading a single chunk of 100,000 images from one file is hundreds of times faster than loading 100,000 individual PNG files.
No More Pre-processing: All images are already cropped and resized. The data is ready for training right out of the box.
Memory Efficient: The dataset is split into manageable chunks, so you don't need to load all ~34GB of data into RAM at once.
Data Structure
The dataset is organized into two main folders: train and val.
/ ├── train/ │ ├── train_chunk_0.pt │ ├── train_chunk_1.pt │ └── ... (and so on for all training chunks) └── val/ ├── val_chunk_0.pt ├── val_chunk_1.pt └── ... (and so on for all validation chunks)
Inside Each Chunk File (.pt)
Each .pt file is a standard PyTorch file containing a single Python dictionary with two keys: 'images' and 'labels'.
'images':
Type: torch.Tensor
Shape: (N, 3, 40, 64), where N is the number of samples in the chunk (typically 100,000).
Data Type (dtype): torch.uint8 (values from 0-255). This is done to save a massive amount of disk space. You will need to convert this to float and normalize it before feeding it to a model.
Description: This tensor contains N raw, uncompressed image pixels. Each image is a 3-channel (RGB) color image with a height of 40 pixels and a width of 64 pixels.
'labels':
Type: list of str
Length: N (matches the number of images in the tensor).
Description: This is a simple Python list of strings. The label at labels[i] corresponds to the image at images[i].
How to Use This Dataset in PyTorch
Here is a simple example of how to load a chunk and access the data.
import torch
from torchvision import transforms
from PIL import Image
# --- 1. Load a single chunk file ---
chunk_path = 'train/train_chunk_0.pt'
data_chunk = torch.load(chunk_path)
image_tensor_chunk = data_chunk['images']
labels_list = data_chunk['labels']
print(f"Loaded chunk: {chunk_path}")
print(f"Image tensor shape: {image_tensor_chunk.shape}")
print(f"Number of labels: {len(labels_list)}")
# --- 2. Get a single sample (e.g., the 42nd item in this chunk) ---
index = 42
image_uint8 = image_tensor_chunk[index]
label = labels_list[index]
print(f"
--- Sample at index {index} ---")
print(f"Label: {label}")
print(f"Image tensor shape (as saved): {image_uint8.shape}")
print(f"Image data type (as saved): {image_uint8.dtype}")
# --- 3. Prepare the image for a model ---
# You need to convert the uint8 tensor (0-255) to a float tensor (0.0-1.0)
# and then normalize it.
# a. Convert to float
image_float = image_uint8.float() / 255.0
# b. Define the normalization (must be the same as used in training)
normalize_transform = transforms.Normalize(
mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5]
)
# c. Apply normalization
normalized_image = normalize_transform(image_float)
print(f"
Image tensor shape (normalized): {normalized_image.shape}")
print(f"Image data type (normalized): {normalized_image.dtype}")
print(f"Min value: {normalized_image.min():.2f}, Max value: {normalized_image.max():.2f}")
# --- (Optional) 4. How to view the image ---
# To convert a tensor back to an image you can view:
# We need to un-normalize it first if we want to see the original colors.
# For simplicity, let's just convert the float tensor before normalization.
image_to_view = transforms.ToPILImage()(image_float)
# You can now display this image_to_view
# image_to_view.show()
# image_to_view.save('sample_image.png')
print("
Successfully prepared a sample for model input and viewing!")
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
README — Code and data
Project: LOCALISED
Work Package 7, Task 7.1
Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union
What this repo does
-------------------
Builds the Transition‑Risk Index (TRI) for EU manufacturing at NUTS‑2 × NACE Rev.2, and reproduces the article’s Figures 3–6:
• Exposure (emissions by region/sector)
• Vulnerability (composite index)
• Risk = Exposure ⊗ Vulnerability
Outputs include intermediate tables, the final analysis dataset, and publication figures.
Folder of interest
------------------
Code and data/
├─ Code/ # R scripts (run in order 1A → 5)
│ └─ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
├─ Initial data/ # Eurostat inputs imputed for missing values
├─ Derived data/ # intermediates
├─ Final data/ # final analysis-ready tables
└─ Figures/ # exported figures
Quick start
-----------
1) Open R (or RStudio) and set the working directory to “Code and data/Code”.
Example: setwd(".../Code and data/Code")
2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
1A-non-sector-data.R → 1B-sector-data.R → 1C-all-data.R → 2-reshape-data.R → 3-normalize-data-by-n-enterpr.R → 4-risk-aggregation.R → 5A-results-maps.R, 5B-results-radar.R
What each script does
---------------------
Create Initial Data — Recreate inputs
• Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
• Write the resulting inputs to Initial data/ for the analysis pipeline.
1A / 1B / 1C — Build the unified base
• Read individual Eurostat datasets (some sectoral, some only regional).
• Harmonize, aggregate, and align them into a single analysis-ready schema.
• Write aggregated outputs to Derived data/ (and/or Final data/ as needed).
2 — Reshape and enrich
• Reshapes the combined data and adds metadata.
• Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).
3 — Normalize (enterprises & min–max)
• Divide selected indicators by number of enterprises.
• Apply min–max normalization to [0.01, 0.99].
• Exposure keeps real zeros (zeros remain zero).
• Write normalized tables to Derived data/ or Final data/.
4 — Aggregate indices
• Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
– Within each dimension: equal‑weight mean of directionally aligned, [0.01,0.99]‑scaled indicators.
– Dimension scores are re‑scaled to [0.01,0.99].
• Aggregate Vulnerability: equal‑weight mean of the five dimensions.
• TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
– Policy‑intuitive properties: high E & high V → high risk; imbalances penalized (non‑compensatory).
• Output: Final data/ (main analysis tables).
5A / 5B — Visualize results
• 5A: maps and distribution plots for Exposure, Vulnerability, and Risk → Figures 3 & 4.
• 5B: comparative/radar profiles for selected countries/regions/subsectors → Figures 5 & 6.
• Outputs saved to Figures/.
Data flow (at a glance)
-----------------------
Initial data → (1A–1C) Aggregated base → (2) Tidy long file → (3) Normalized indicators → (4) Composite indices → (5) Figures
| | |
v v v
Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/
Assumptions & conventions
-------------------------
• Geography: EU NUTS‑2 regions; Sector: NACE Rev.2 manufacturing subsectors.
• Equal weights by default where no evidence supports alternatives.
• All indicators directionally aligned so that higher = greater transition difficulty.
• Relative paths assume working directory = Code/.
Reproducing the article
-----------------------
• Optionally run the codes from the Code/Create Initial Data subfolder
• Run 1A → 5B without interruption to regenerate:
– Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
– Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
– Figure 5: Drivers of risk—highest vs. lowest risk regions (example: Germany & Greece).
– Figure 6: Subsector case (e.g., basic metals) by selected regions.
• Final tables for the paper live in Final data/. Figures export to Figures/.
Requirements
------------
• R (version per your environment).
• Install any missing packages listed at the top of each script (e.g., install.packages("...")).
Troubleshooting
---------------
• “File not found”: check that the previous script finished and wrote its outputs to the expected folder.
• Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
• Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.
Provenance & citation
---------------------
• Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
• Methods: OECD composite‑indicator guidance; IPCC AR6 risk framing (see paper references).
• If you use this code, please cite the article:
A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data can broadly be divided in discrete or continuous data. Continuous data are obtained from measurements that are performed as a function of another quantitative variable, e.g., time, length, concentration, or wavelength. The results from these types of experiments are often used to generate plots that visualize the measured variable on a continuous, quantitative scale. To simplify state-of-the-art data visualization and annotation of data from such experiments, an open-source tool was created with R/shiny that does not require coding skills to operate it. The freely available web app accepts wide (spreadsheet) and tidy data and offers a range of options to normalize the data. The data from individual objects can be shown in 3 different ways: (1) lines with unique colors, (2) small multiples, and (3) heatmap-style display. Next to this, the mean can be displayed with a 95% confidence interval for the visual comparison of different conditions. Several color-blind-friendly palettes are available to label the data and/or statistics. The plots can be annotated with graphical features and/or text to indicate any perturbations that are relevant. All user-defined settings can be stored for reproducibility of the data visualization. The app is dubbed PlotTwist and runs locally or online: https://huygens.science.uva.nl/PlotTwist
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore how public SaaS valuations now match private markets, with forward revenue multiples hitting 9x. Key analysis of ARR metrics and growth trends in 2018.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Normalized Nanostring transcriptomic data (fold2-change- and Benjamini–Hochberg adjusted p-values) were exported as an .xlsx file. Groups include vehicle control (N=3), treated fatal (N=2), and treated survivor subjects administered ODV for 5 (N=3) or 10 days (N=5) compared against a pre-challenge baseline (0 DPI) at each collection timepoint. Any differentially expressed transcripts with a Benjamini-Hochberg false discovery rate (FDR) corrected p-value less than 0.05 were deemed significant. ODV, obeldesivir; DPI, days post infection. Methods NHPV2_Immunology reporter and capture probe sets (NanoString Technologies) were hybridized with 3 µL of each RNA sample for ~24 hours at 65°C. The RNA:probe set complexes were subsequently loaded onto an nCounter microfluidics cartridge and assayed using a NanoString nCounter SPRINT Profiler. Samples with an image binding density greater than 2.0 were re-analyzed with 1 µL of RNA to meet quality control criteria. Briefly, nCounter .RCC files were imported into NanoString nSolver 4.0 software. To compensate for varying RNA inputs and reaction efficiency, an array of 10 housekeeping genes and spiked-in positive and negative controls were used to normalize the raw read counts. The array and number of housekeeping mRNAs are selected by default within the Nanostring nSolver Advanced Analysis module. As both sample input and reaction efficiency are expected to affect all probes uniformly, normalization for run-to-run and sample-to-sample variability is performed by dividing counts within a lane by the geometric mean of the reference/normalizer probes from the same lane (i.e., all probes/count levels within a lane are adjusted by the same factor). The ideal normalization genes are automatically determined by selecting those that minimize the pairwise variation statistic and are selected using the widely used geNorm algorithm as implemented in the Bioconductor package NormqPCR. The data was analyzed with NanoString nSolver Advanced Analysis 2.0 package for differential expression. Normalized data (fold2-change- and Benjamini–Hochberg adjusted p-values) were exported as an .xlsx file (Data S1). Groups include vehicle control (N=3), treated fatal (N=2), and treated survivor subjects administered ODV for 5 (N=3) or 10 days (N=5) compared against a pre-challenge baseline (0 DPI) at each collection timepoint. Any differentially expressed transcripts with a Benjamini-Hochberg false discovery rate (FDR) corrected p-value less than 0.05 were deemed significant. Human annotations were added for each respective mRNA to perform immune cell profiling within nSolver (Data S2). For the heatmaps, groups of vehicle control (N=3), treated fatal (N=2), and treated survivor subjects administered ODV for 5 (N=3) or 10 days (N=5) were compared against their pre-challenge baseline (0 DPI) at each collection timepoint. For enrichment analysis, differentially expressed transcripts and adjusted p-values from the Data S1 file were imported into Ingenuity Pathway Analysis (IPA; Qiagen) for canonical pathway, upstream analysis, disease and function, and tox function analyses with respect to a pre-challenge baseline (Data S3). The topmost significant pathways based on z-scores were imported into GraphPad Prism version 10.0.1 to produce heatmaps.
Facebook
TwitterA novel brain tumor dataset containing 4500 2D MRI-CT slices. The original MRI and CT scans are also contained in this dataset.
Pre-processing strategy: The pre-processing data pipeline includes pairing MRI and CT scans according to a specific time interval between CT and MRI scans of the same patient, MRI image registration to a standard template, MRI-CT image registration, intensity normalization, and extracting 2D slices from 3D volumes. The pipeline can be used to obtain classic 2D MRI-CT images from 3D Dicom format MRI and CT scans, which can be directly used as the training data for the end-to-end synthetic CT deep learning networks. Detail: Pairing MRI and CT scan: If the time interval between MRI and CT scans is too long, the information in MRI and CT images will not match. Therefore, we pair MRI and CT scans according to a certain time interval between CT and MRI scans of the same patient, which should not exceed half a year. MRI image registration: Considering the differences both in the human brain and space coordinates of radiation images during scanning, the dataset must avoid individual differences and unify the coordinates, which means all the CT and MRI images should be registered to the standard template. The generated images can be more accurate after registration. The template proposed by Montreal Neurosciences Institute is called MNI ICBM 152 non-linear 6th Generation Symmetric Average Brain Stereotaxic Registration Model (MNI 152) (Grabneret al., 2006). Affine registration is first applied to register MRI scans to the MNI152 template. Intensity normalization: The registered scans have some extreme values, which introduce errors that would affect the generation accuracy. We normalize the image data and eliminated these extreme values by selecting the pixel values ranked at the top 1% and bottom 1% and replacing the original pixel values of these pixels with the pixel values of 1% and 99%. Extracting 2D slices from 3D volumes: After carrying out the registration, the 3D MRI and CT scans can be represented as 237×197×189 matrices. To ensure the compatibility between training models and inputs, each 3D image is sliced, and 4500 2D MRI-CT image pairs are selected as the final training data.
Source database: 1. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=33948305 2. https://wiki.cancerimagingarchive.net/display/Public/CPTAC-GBM 3. https://wiki.cancerimagingarchive.net/display/Public/TCGA-GBM
Patient information: Number of patients: 41
Introduction of each file: Dicom: contains the source file collected from the three websites above. data(processed): contains the processed data which are saved as .npy type. you can use the train_input.npy and train_output.npy as the input and output of the encoder-decoder structure to train the model. Test and Val input and output can be used as test and validation datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RNA half-life estimates from uniformly reprocessed/reanalyzed, published, high quality nucleotide recoding RNA-seq (NR-seq; namely SLAM-seq and TimeLapse-seq) datasets. 12 human cell lines are represented. Data can be browsed at this website.
Analysis notes:
Relevant data provided in this repository are as follows:
Datasets included:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Zurich Summer v1.0" dataset is a collection of 20 chips (crops), taken from a QuickBird acquisition of the city of Zurich (Switzerland) in August 2002. QuickBird images are composed by 4 channels (NIR-R-G-B) and were pansharpened to the PAN resolution of about 0.62 cm GSD. We manually annotated 8 different urban and periurban classes : Roads, Buildings, Trees, Grass, Bare Soil, Water, Railways and Swimming pools. The cumulative number of class samples is highly unbalanced, to reflect real world situations. Note that annotations are not perfect, are not ultradense (not every pixel is annotated) and there might be some errors as well. We performed annotations by jointly selecting superpixels (SLIC) and drawing (freehand) over regions which we could confidently assign an object class.
The dataset is composed by 20 image - ground truth pairs, in geotiff format. Images are distributed in raw DN values. We provide a rough and dirty MATLAB script (preprocess.m) to:
i) extract basic statistics from images (min, max, mean and average std) which should be used to globally normalize the data (note that class distribution of the chips is highly uneven, so single-frame normalization would shift distribution of classes).
ii) Visualize raw DN images (with unsaturated values) and a corresponding stretched version (good for illustration purposes). It also saves a raw and adjusted image version in MATLAB format (.mat) in a local subfolder.
iii) Convert RGB annotations to index mask (CLASS \in {1,...,C}) (via rgb2label.m provided).
iv) Convert index mask to georeferenced RGB annotations (via rgb2label.m provided). Useful if you want to see the final maps of the tiles in some GIS software (coordinate system copied from original geotiffs).
Some requests from you
We encourage researchers to report the ID of images used for training / validation / test (e.g. train: zh1 to zh7, validation zh8 to zh12 and test zh13 to zh20). The purpose of distributing datasets is to encourage reproducibility of experiments.
Acknowledgements
We release this data after a kind agreement obtained with DigitalGlobe, co. This data can be redistributed freely, provided that this document and corresponding license are part of the distribution. Ideally, since the dataset could be updated over the time, I suggest to distribute the dataset by the official link from which this archive has been downloaded.
We would like to thank (a lot) Nathan Longbotham @ DigitalGlobe and the whole DG team for his / their help for granting the distribution of the dataset.
We release this dataset hoping that will help researchers working in semantic classification / segmentation of remote sensing data in comparing to other state-of-the-art methods using this dataset as well in testing models on a larger and more complete set of images (with respect to most benchmarks available in our community). As you can imagine, it has been a tedious work in preparing everything. Just for you.
If you are using the data please cite the following work
Volpi, M. & Ferrari, V.; Semantic segmentation of urban scenes by learning local class interactions, In IEEE CVPR 2015 Workshop "Looking from above: when Earth observation meets vision" (EARTHVISION), Boston, USA, 2015.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
Facebook
TwitterText (SOS Messages)
Clean, tokenize, embed using BERT/DistilBERT.
Output: vector embedding (e.g., [768]).
IoT Sensor Data (Time Series)
Normalize values (0–1).
Extract features (anomaly score, mean, variance, FFT features).
Output: vector embedding (e.g., [64]).
Images (Aerial/Drone)
Use CNN (YOLOv8, ResNet, EfficientNet) for feature extraction.
Output: feature map → vector (e.g., [1024]).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Introduction Long-term electrocardiogram (ECG) recordings are widely employed to assist the diagnosis of cardiac and sleep disorders. However, variability of ECG amplitude during the recordings hampers the detection of QRS complexes by algorithms. This work presents a simple electronic circuit to automatically normalize the ECG amplitude, improving its sampling by analog to digital converters (ADCs). Methods The proposed circuit consists of an analog divider that normalizes the ECG amplitude using its absolute peak value as reference. The reference value is obtained by means of a full-wave rectifier and a peak voltage detector. The circuit and tasks of its different stages are described. Results Example of the circuit performance for a bradycardia ECG signal (40bpm) is presented; the signal has its amplitude suddenly halved, and later, restored. The signal is automatically normalized after 5 heart beats for the amplitude drop. For the amplitude increase, the signal is promptly normalized. Conclusion The proposed circuit adjusts the ECG amplitude to the input voltage range of ADC, avoiding signal to noise ratio degradation of the sampled waveform in order to allow a better performance of processing algorithms.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.
The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:
| Column Name | Description |
|---|---|
| Student_ID | Unique identifier for each student (e.g., S0001, S0002, …) |
| Age | Age of the student (between 18 and 25 years) |
| Gender | Gender of the student (Male/Female) |
| Study_Hours | Average number of study hours per day (contains missing values and outliers) |
| Attendance(%) | Percentage of class attendance (contains missing values) |
| Test_Score | Final exam score (0–100 scale) |
| Grade | Letter grade derived from test scores (F, C, B, A, A+) |
Test_Score → Predict test score based on study hours, attendance, age, and gender.
Predict the student’s test score using their study hours, attendance percentage, and age.
🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']
You can use:
And analyze feature influence using correlation or SHAP/LIME explainability.
Facebook
TwitterThis project is a data analysis package to analyze and normalize SHAPE data. Traditionally, SHAPE requires the addition of a 3' hairpin to the RNA for normalization. noRNAlize elminates the need for this experimental step by performing a global analysis of the SHAPE data, and establishing mean protection values. This is particularly important when SHAPE analysis is used to map crystal contacts in crystal structures as illustrated here.
This project includes the following software/data packages: