27 datasets found
  1. S

    noRNAlize: SHAPE data normalization software

    • simtk.org
    Updated Nov 14, 2006
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quentin Vicens; Alain Laederach (2006). noRNAlize: SHAPE data normalization software [Dataset]. https://simtk.org/frs/?group_id=159
    Explore at:
    application/zip,application/pdf,text/plain(37 MB)Available download formats
    Dataset updated
    Nov 14, 2006
    Dataset provided by
    Consulting
    UNC-Chapel Hill
    Authors
    Quentin Vicens; Alain Laederach
    Description

    This project is a data analysis package to analyze and normalize SHAPE data. Traditionally, SHAPE requires the addition of a 3' hairpin to the RNA for normalization. noRNAlize elminates the need for this experimental step by performing a global analysis of the SHAPE data, and establishing mean protection values. This is particularly important when SHAPE analysis is used to map crystal contacts in crystal structures as illustrated here.



    This project includes the following software/data packages:

    • v1.0 : Alpha Release or noRNAlize

  2. LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time...

    • plos.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas (2023). LEMming: A Linear Error Model to Normalize Parallel Quantitative Real-Time PCR (qPCR) Data as an Alternative to Reference Gene Based Methods [Dataset]. http://doi.org/10.1371/journal.pone.0135852
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ronny Feuer; Sebastian Vlaic; Janine Arlt; Oliver Sawodny; Uta Dahmen; Ulrich M. Zanger; Maria Thomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.

  3. HMS HBAC Train Spectrograms 2

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishal (2024). HMS HBAC Train Spectrograms 2 [Dataset]. https://www.kaggle.com/datasets/vishalbakshi/hms-hbac-train-spectrograms-2
    Explore at:
    zip(2848189987 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Vishal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a dataset of spectrogram images created from the train_spectrograms parquet data from the Harvard Medical School Harmful Brain Activity Classification competition. The parquet files have been transformed with the following code, referencing the HMS-HBAC: KerasCV Starter Notebook

    def process_spec(spec_id, split="train"):
      # read the data
      data = pd.read_parquet(path/f'{split}_spectrograms'/f'{spec_id}.parquet')
      
      # read the label
      label = unique_df[unique_df.spectrogram_id == spec_id]["target"].item()
      
      # replace NA with 0
      data = data.fillna(0)
      
      # convert DataFrame to array
      data = data.values[:, 1:]
      
      # transpose
      data = data.T
      data = data.astype("float32")
      
      # clip data to avoid 0s
      data = np.clip(data, math.exp(-4), math.exp(8))
    
      # take log data to magnify differences
      data = np.log(data)
    
      # normalize data
      data=(data-data.mean())/data.std() + 1e-6
    
      # convert to 3 channels
      data = np.tile(data[..., None], (1, 1, 3))
      
      # convert array to PILImage
      im = PILImage.create(Image.fromarray((data * 255).astype(np.uint8)))
      im.save(f"{SPEC_DIR}/{split}_spectrograms/{label}/{spec_id}.png")
    
  4. Normalization techniques for PARAFAC modeling of urine metabolomics data

    • data.niaid.nih.gov
    xml
    Updated May 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radana Karlikova (2017). Normalization techniques for PARAFAC modeling of urine metabolomics data [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls290
    Explore at:
    xmlAvailable download formats
    Dataset updated
    May 11, 2017
    Dataset provided by
    IMTM, Faculty of Medicine and Dentistry, Palacky University Olomouc, Hnevotinska 5, 775 15 Olomouc, Czech Republic
    Authors
    Radana Karlikova
    Variables measured
    Sample type, Metabolomics, Sample collection time
    Description

    One of the body fluids often used in metabolomics studies is urine. The peak intensities of metabolites in urine are affected by the urine history of an individual resulting in dilution differences. This requires therefore normalization of the data to correct for such differences. Two normalization techniques are commonly applied to urine samples prior to their further statistical analysis. First, AUC normalization aims to normalize a group of signals with peaks by standardizing the area under the curve (AUC) within a sample to the median, mean or any other proper representation of the amount of dilution. The second approach uses specific end-product metabolites such as creatinine and all intensities within a sample are expressed relative to the creatinine intensity. Another way of looking at urine metabolomics data is by realizing that the ratios between peak intensities are the information-carrying features. This opens up possibilities to use another class of data analysis techniques designed to deal with such ratios: compositional data analysis. In this approach special transformations are defined to deal with the ratio problem. In essence, it comes down to using another distance measure than the Euclidian Distance that is used in the conventional analysis of metabolomics data. We will illustrate using this type of approach in combination with three-way methods (i.e. PARAFAC) to be used in cases where samples of some biological material are measured at multiple time points. Aim of the paper is to develop PARAFAC modeling of three-way metabolomics data in the context of compositional data and compare this with standard normalization techniques for the specific case of urine metabolomics data.

  5. Khmer Word Image Patches For Training OCR

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chanveasna ENG (2025). Khmer Word Image Patches For Training OCR [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/khwordpatches
    Explore at:
    zip(5065136281 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Chanveasna ENG
    Description

    Synthetic Khmer OCR - Pre-processed Chunks

    This dataset is a pre-processed and optimized version of the original "Synthetic Khmer OCR" dataset. All word images have been cropped, resized with padding to a uniform size, and stored in highly efficient PyTorch tensor chunks for extremely fast loading during model training.

    This format is designed to completely eliminate the I/O bottleneck that comes from reading millions of individual small image files, allowing you to feed a powerful GPU without any waiting. Why This Format?

    Extreme Speed: Loading a single chunk of 100,000 images from one file is hundreds of times faster than loading 100,000 individual PNG files.
    
    No More Pre-processing: All images are already cropped and resized. The data is ready for training right out of the box.
    
    Memory Efficient: The dataset is split into manageable chunks, so you don't need to load all ~34GB of data into RAM at once.
    

    Data Structure

    The dataset is organized into two main folders: train and val.

    / ├── train/ │ ├── train_chunk_0.pt │ ├── train_chunk_1.pt │ └── ... (and so on for all training chunks) └── val/ ├── val_chunk_0.pt ├── val_chunk_1.pt └── ... (and so on for all validation chunks)

    Inside Each Chunk File (.pt)

    Each .pt file is a standard PyTorch file containing a single Python dictionary with two keys: 'images' and 'labels'.

    'images':
    
      Type: torch.Tensor
    
      Shape: (N, 3, 40, 80), where N is the number of samples in the chunk (typically 100,000).
    
      Data Type (dtype): torch.uint8 (values from 0-255). This is done to save a massive amount of disk space. You will need to convert this to float and normalize it before feeding it to a model.
    
      Description: This tensor contains N raw, uncompressed image pixels. Each image is a 3-channel (RGB) color image with a height of 40 pixels and a width of 64 pixels.
    
    'labels':
    
      Type: list of str
    
      Length: N (matches the number of images in the tensor).
    
      Description: This is a simple Python list of strings. The label at labels[i] corresponds to the image at images[i].
    

    How to Use This Dataset in PyTorch

    Here is a simple example of how to load a chunk and access the data. ``` import torch from torchvision import transforms from PIL import Image

    --- 1. Load a single chunk file ---

    chunk_path = 'train/train_chunk_0.pt' data_chunk = torch.load(chunk_path)

    image_tensor_chunk = data_chunk['images'] labels_list = data_chunk['labels']

    print(f"Loaded chunk: {chunk_path}") print(f"Image tensor shape: {image_tensor_chunk.shape}") print(f"Number of labels: {len(labels_list)}")

    --- 2. Get a single sample (e.g., the 42nd item in this chunk) ---

    index = 42 image_uint8 = image_tensor_chunk[index] label = labels_list[index]

    print(f" --- Sample at index {index} ---") print(f"Label: {label}") print(f"Image tensor shape (as saved): {image_uint8.shape}") print(f"Image data type (as saved): {image_uint8.dtype}")

    --- 3. Prepare the image for a model ---

    You need to convert the uint8 tensor (0-255) to a float tensor (0.0-1.0)

    and then normalize it.

    a. Convert to float

    image_float = image_uint8.float() / 255.0

    b. Define the normalization (must be the same as used in training)

    normalize_transform = transforms.Normalize( mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] )

    c. Apply normalization

    normalized_image = normalize_transform(image_float)

    print(f" Image tensor shape (normalized): {normalized_image.shape}") print(f"Image data type (normalized): {normalized_image.dtype}") print(f"Min value: {normalized_image.min():.2f}, Max value: {normalized_image.max():.2f}")

    --- (Optional) 4. How to view the image ---

    To convert a tensor back to an image you can view:

    We need to un-normalize it first if we want to see the original colors.

    For simplicity, let's just convert the float tensor before normalization.

    image_to_view = transforms.ToPILImage()(image_float)

    You can now display this image_to_view

    image_to_view.show()

    image_to_view.save('sample_image.png')

    print(" Successfully prepared a sample for model input and viewing!") ```

  6. f

    Table_1_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • datasetcatalog.nlm.nih.gov
    Updated Jun 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bushel, Pierre R.; Ramaiahgari, Sreenivasa C.; Auerbach, Scott S.; Paules, Richard S.; Ferguson, Stephen S. (2020). Table_1_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.XLSX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000579045
    Explore at:
    Dataset updated
    Jun 23, 2020
    Authors
    Bushel, Pierre R.; Ramaiahgari, Sreenivasa C.; Auerbach, Scott S.; Paules, Richard S.; Ferguson, Stephen S.
    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  7. Khmer Subsyllables Image Patches For Training OCR

    • kaggle.com
    zip
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chanveasna ENG (2025). Khmer Subsyllables Image Patches For Training OCR [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/khsubsyllablespatches
    Explore at:
    zip(3561443645 bytes)Available download formats
    Dataset updated
    Nov 22, 2025
    Authors
    Chanveasna ENG
    Description

    Synthetic Khmer OCR - Pre-processed Chunks

    This dataset is a pre-processed and optimized version of the original "Synthetic Khmer OCR" dataset. All word images have been cropped, resized with padding to a uniform size, and stored in highly efficient PyTorch tensor chunks for extremely fast loading during model training.

    This format is designed to completely eliminate the I/O bottleneck that comes from reading millions of individual small image files, allowing you to feed a powerful GPU without any waiting. Why This Format?

    Extreme Speed: Loading a single chunk of 100,000 images from one file is hundreds of times faster than loading 100,000 individual PNG files.
    
    No More Pre-processing: All images are already cropped and resized. The data is ready for training right out of the box.
    
    Memory Efficient: The dataset is split into manageable chunks, so you don't need to load all ~34GB of data into RAM at once.
    

    Data Structure

    The dataset is organized into two main folders: train and val.

    / ├── train/ │ ├── train_chunk_0.pt │ ├── train_chunk_1.pt │ └── ... (and so on for all training chunks) └── val/ ├── val_chunk_0.pt ├── val_chunk_1.pt └── ... (and so on for all validation chunks)

    Inside Each Chunk File (.pt)

    Each .pt file is a standard PyTorch file containing a single Python dictionary with two keys: 'images' and 'labels'.

    'images':
    
      Type: torch.Tensor
    
      Shape: (N, 3, 40, 64), where N is the number of samples in the chunk (typically 100,000).
    
      Data Type (dtype): torch.uint8 (values from 0-255). This is done to save a massive amount of disk space. You will need to convert this to float and normalize it before feeding it to a model.
    
      Description: This tensor contains N raw, uncompressed image pixels. Each image is a 3-channel (RGB) color image with a height of 40 pixels and a width of 64 pixels.
    
    'labels':
    
      Type: list of str
    
      Length: N (matches the number of images in the tensor).
    
      Description: This is a simple Python list of strings. The label at labels[i] corresponds to the image at images[i].
    

    How to Use This Dataset in PyTorch

    Here is a simple example of how to load a chunk and access the data.

    import torch
    from torchvision import transforms
    from PIL import Image
    
    # --- 1. Load a single chunk file ---
    chunk_path = 'train/train_chunk_0.pt'
    data_chunk = torch.load(chunk_path)
    
    image_tensor_chunk = data_chunk['images']
    labels_list = data_chunk['labels']
    
    print(f"Loaded chunk: {chunk_path}")
    print(f"Image tensor shape: {image_tensor_chunk.shape}")
    print(f"Number of labels: {len(labels_list)}")
    
    # --- 2. Get a single sample (e.g., the 42nd item in this chunk) ---
    index = 42
    image_uint8 = image_tensor_chunk[index]
    label = labels_list[index]
    
    print(f"
    --- Sample at index {index} ---")
    print(f"Label: {label}")
    print(f"Image tensor shape (as saved): {image_uint8.shape}")
    print(f"Image data type (as saved): {image_uint8.dtype}")
    
    
    # --- 3. Prepare the image for a model ---
    # You need to convert the uint8 tensor (0-255) to a float tensor (0.0-1.0)
    # and then normalize it.
    
    # a. Convert to float
    image_float = image_uint8.float() / 255.0
    
    # b. Define the normalization (must be the same as used in training)
    normalize_transform = transforms.Normalize(
      mean=[0.5, 0.5, 0.5],
      std=[0.5, 0.5, 0.5]
    )
    
    # c. Apply normalization
    normalized_image = normalize_transform(image_float)
    
    print(f"
    Image tensor shape (normalized): {normalized_image.shape}")
    print(f"Image data type (normalized): {normalized_image.dtype}")
    print(f"Min value: {normalized_image.min():.2f}, Max value: {normalized_image.max():.2f}")
    
    
    # --- (Optional) 4. How to view the image ---
    # To convert a tensor back to an image you can view:
    # We need to un-normalize it first if we want to see the original colors.
    # For simplicity, let's just convert the float tensor before normalization.
    image_to_view = transforms.ToPILImage()(image_float)
    
    # You can now display this image_to_view
    # image_to_view.show() 
    # image_to_view.save('sample_image.png')
    print("
    Successfully prepared a sample for model input and viewing!")
    
  8. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  9. Data for A Systemic Framework for Assessing the Risk of Decarbonization to...

    • zenodo.org
    txt
    Updated Sep 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola (2025). Data for A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union [Dataset]. http://doi.org/10.5281/zenodo.17152310
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Soheil Shayegh; Soheil Shayegh; Giorgia Coppola; Giorgia Coppola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 18, 2025
    Area covered
    European Union
    Description

    README — Code and data
    Project: LOCALISED

    Work Package 7, Task 7.1

    Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union

    What this repo does
    -------------------
    Builds the Transition‑Risk Index (TRI) for EU manufacturing at NUTS‑2 × NACE Rev.2, and reproduces the article’s Figures 3–6:
    • Exposure (emissions by region/sector)
    • Vulnerability (composite index)
    • Risk = Exposure ⊗ Vulnerability
    Outputs include intermediate tables, the final analysis dataset, and publication figures.

    Folder of interest
    ------------------
    Code and data/
    ├─ Code/ # R scripts (run in order 1A → 5)
    │ └─ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
    ├─ Initial data/ # Eurostat inputs imputed for missing values
    ├─ Derived data/ # intermediates
    ├─ Final data/ # final analysis-ready tables
    └─ Figures/ # exported figures

    Quick start
    -----------
    1) Open R (or RStudio) and set the working directory to “Code and data/Code”.
    Example: setwd(".../Code and data/Code")
    2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
    To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
    These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
    3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
    1A-non-sector-data.R → 1B-sector-data.R → 1C-all-data.R → 2-reshape-data.R → 3-normalize-data-by-n-enterpr.R → 4-risk-aggregation.R → 5A-results-maps.R, 5B-results-radar.R

    What each script does
    ---------------------
    Create Initial Data — Recreate inputs
    • Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
    • Write the resulting inputs to Initial data/ for the analysis pipeline.

    1A / 1B / 1C — Build the unified base
    • Read individual Eurostat datasets (some sectoral, some only regional).
    • Harmonize, aggregate, and align them into a single analysis-ready schema.
    • Write aggregated outputs to Derived data/ (and/or Final data/ as needed).

    2 — Reshape and enrich
    • Reshapes the combined data and adds metadata.
    • Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).

    3 — Normalize (enterprises & min–max)
    • Divide selected indicators by number of enterprises.
    • Apply min–max normalization to [0.01, 0.99].
    • Exposure keeps real zeros (zeros remain zero).
    • Write normalized tables to Derived data/ or Final data/.

    4 — Aggregate indices
    • Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
    – Within each dimension: equal‑weight mean of directionally aligned, [0.01,0.99]‑scaled indicators.
    – Dimension scores are re‑scaled to [0.01,0.99].
    • Aggregate Vulnerability: equal‑weight mean of the five dimensions.
    • TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
    – Policy‑intuitive properties: high E & high V → high risk; imbalances penalized (non‑compensatory).
    • Output: Final data/ (main analysis tables).

    5A / 5B — Visualize results
    • 5A: maps and distribution plots for Exposure, Vulnerability, and Risk → Figures 3 & 4.
    • 5B: comparative/radar profiles for selected countries/regions/subsectors → Figures 5 & 6.
    • Outputs saved to Figures/.

    Data flow (at a glance)
    -----------------------
    Initial data → (1A–1C) Aggregated base → (2) Tidy long file → (3) Normalized indicators → (4) Composite indices → (5) Figures
    | | |
    v v v
    Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/

    Assumptions & conventions
    -------------------------
    • Geography: EU NUTS‑2 regions; Sector: NACE Rev.2 manufacturing subsectors.
    • Equal weights by default where no evidence supports alternatives.
    • All indicators directionally aligned so that higher = greater transition difficulty.
    • Relative paths assume working directory = Code/.

    Reproducing the article
    -----------------------
    • Optionally run the codes from the Code/Create Initial Data subfolder
    • Run 1A → 5B without interruption to regenerate:
    – Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
    – Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
    – Figure 5: Drivers of risk—highest vs. lowest risk regions (example: Germany & Greece).
    – Figure 6: Subsector case (e.g., basic metals) by selected regions.
    • Final tables for the paper live in Final data/. Figures export to Figures/.

    Requirements
    ------------
    • R (version per your environment).
    • Install any missing packages listed at the top of each script (e.g., install.packages("...")).

    Troubleshooting
    ---------------
    • “File not found”: check that the previous script finished and wrote its outputs to the expected folder.
    • Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
    • Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.

    Provenance & citation
    ---------------------
    • Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
    • Methods: OECD composite‑indicator guidance; IPCC AR6 risk framing (see paper references).
    • If you use this code, please cite the article:
    A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.

  10. f

    PlotTwist: A web app for plotting and annotating continuous data

    • plos.figshare.com
    • figshare.com
    docx
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joachim Goedhart (2020). PlotTwist: A web app for plotting and annotating continuous data [Dataset]. http://doi.org/10.1371/journal.pbio.3000581
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    PLOS Biology
    Authors
    Joachim Goedhart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experimental data can broadly be divided in discrete or continuous data. Continuous data are obtained from measurements that are performed as a function of another quantitative variable, e.g., time, length, concentration, or wavelength. The results from these types of experiments are often used to generate plots that visualize the measured variable on a continuous, quantitative scale. To simplify state-of-the-art data visualization and annotation of data from such experiments, an open-source tool was created with R/shiny that does not require coding skills to operate it. The freely available web app accepts wide (spreadsheet) and tidy data and offers a range of options to normalize the data. The data from individual objects can be shown in 3 different ways: (1) lines with unique colors, (2) small multiples, and (3) heatmap-style display. Next to this, the mean can be displayed with a 95% confidence interval for the visual comparison of different conditions. Several color-blind-friendly palettes are available to label the data and/or statistics. The plots can be annotated with graphical features and/or text to indicate any perturbations that are relevant. All user-defined settings can be stored for reproducibility of the data visualization. The app is dubbed PlotTwist and runs locally or online: https://huygens.science.uva.nl/PlotTwist

  11. t

    Normalization of Valuations in the Public and Private Software Markets -...

    • tomtunguz.com
    Updated Oct 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomasz Tunguz (2018). Normalization of Valuations in the Public and Private Software Markets - Data Analysis [Dataset]. https://tomtunguz.com/publi-arr-multiples/
    Explore at:
    Dataset updated
    Oct 9, 2018
    Dataset provided by
    Theory Ventures
    Authors
    Tomasz Tunguz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore how public SaaS valuations now match private markets, with forward revenue multiples hitting 9x. Key analysis of ARR metrics and growth trends in 2018.

  12. b

    Transcriptional changes in macaques exposed to Sudan virus and treated with...

    • nde-dev.biothings.io
    • datadryad.org
    zip
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Courtney Woolsey (2024). Transcriptional changes in macaques exposed to Sudan virus and treated with a vehicle controls or obeldesivir for 5 or 10 days [Dataset]. http://doi.org/10.5061/dryad.wdbrv15vn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    The University of Texas Medical Branch at Galveston
    Authors
    Courtney Woolsey
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Normalized Nanostring transcriptomic data (fold2-change- and Benjamini–Hochberg adjusted p-values) were exported as an .xlsx file. Groups include vehicle control (N=3), treated fatal (N=2), and treated survivor subjects administered ODV for 5 (N=3) or 10 days (N=5) compared against a pre-challenge baseline (0 DPI) at each collection timepoint. Any differentially expressed transcripts with a Benjamini-Hochberg false discovery rate (FDR) corrected p-value less than 0.05 were deemed significant. ODV, obeldesivir; DPI, days post infection. Methods NHPV2_Immunology reporter and capture probe sets (NanoString Technologies) were hybridized with 3 µL of each RNA sample for ~24 hours at 65°C. The RNA:probe set complexes were subsequently loaded onto an nCounter microfluidics cartridge and assayed using a NanoString nCounter SPRINT Profiler. Samples with an image binding density greater than 2.0 were re-analyzed with 1 µL of RNA to meet quality control criteria. Briefly, nCounter .RCC files were imported into NanoString nSolver 4.0 software. To compensate for varying RNA inputs and reaction efficiency, an array of 10 housekeeping genes and spiked-in positive and negative controls were used to normalize the raw read counts. The array and number of housekeeping mRNAs are selected by default within the Nanostring nSolver Advanced Analysis module. As both sample input and reaction efficiency are expected to affect all probes uniformly, normalization for run-to-run and sample-to-sample variability is performed by dividing counts within a lane by the geometric mean of the reference/normalizer probes from the same lane (i.e., all probes/count levels within a lane are adjusted by the same factor). The ideal normalization genes are automatically determined by selecting those that minimize the pairwise variation statistic and are selected using the widely used geNorm algorithm as implemented in the Bioconductor package NormqPCR. The data was analyzed with NanoString nSolver Advanced Analysis 2.0 package for differential expression. Normalized data (fold2-change- and Benjamini–Hochberg adjusted p-values) were exported as an .xlsx file (Data S1). Groups include vehicle control (N=3), treated fatal (N=2), and treated survivor subjects administered ODV for 5 (N=3) or 10 days (N=5) compared against a pre-challenge baseline (0 DPI) at each collection timepoint. Any differentially expressed transcripts with a Benjamini-Hochberg false discovery rate (FDR) corrected p-value less than 0.05 were deemed significant. Human annotations were added for each respective mRNA to perform immune cell profiling within nSolver (Data S2). For the heatmaps, groups of vehicle control (N=3), treated fatal (N=2), and treated survivor subjects administered ODV for 5 (N=3) or 10 days (N=5) were compared against their pre-challenge baseline (0 DPI) at each collection timepoint. For enrichment analysis, differentially expressed transcripts and adjusted p-values from the Data S1 file were imported into Ingenuity Pathway Analysis (IPA; Qiagen) for canonical pathway, upstream analysis, disease and function, and tox function analyses with respect to a pre-challenge baseline (Data S3). The topmost significant pathways based on z-scores were imported into GraphPad Prism version 10.0.1 to produce heatmaps.

  13. Brain tumor MRI and CT scan

    • kaggle.com
    zip
    Updated Oct 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chenghan pu (2022). Brain tumor MRI and CT scan [Dataset]. https://www.kaggle.com/datasets/chenghanpu/brain-tumor-mri-and-ct-scan/code
    Explore at:
    zip(3685497430 bytes)Available download formats
    Dataset updated
    Oct 3, 2022
    Authors
    chenghan pu
    Description

    A novel brain tumor dataset containing 4500 2D MRI-CT slices. The original MRI and CT scans are also contained in this dataset.

    Pre-processing strategy: The pre-processing data pipeline includes pairing MRI and CT scans according to a specific time interval between CT and MRI scans of the same patient, MRI image registration to a standard template, MRI-CT image registration, intensity normalization, and extracting 2D slices from 3D volumes. The pipeline can be used to obtain classic 2D MRI-CT images from 3D Dicom format MRI and CT scans, which can be directly used as the training data for the end-to-end synthetic CT deep learning networks. Detail: Pairing MRI and CT scan: If the time interval between MRI and CT scans is too long, the information in MRI and CT images will not match. Therefore, we pair MRI and CT scans according to a certain time interval between CT and MRI scans of the same patient, which should not exceed half a year. MRI image registration: Considering the differences both in the human brain and space coordinates of radiation images during scanning, the dataset must avoid individual differences and unify the coordinates, which means all the CT and MRI images should be registered to the standard template. The generated images can be more accurate after registration. The template proposed by Montreal Neurosciences Institute is called MNI ICBM 152 non-linear 6th Generation Symmetric Average Brain Stereotaxic Registration Model (MNI 152) (Grabneret al., 2006). Affine registration is first applied to register MRI scans to the MNI152 template. Intensity normalization: The registered scans have some extreme values, which introduce errors that would affect the generation accuracy. We normalize the image data and eliminated these extreme values by selecting the pixel values ranked at the top 1% and bottom 1% and replacing the original pixel values of these pixels with the pixel values of 1% and 99%. Extracting 2D slices from 3D volumes: After carrying out the registration, the 3D MRI and CT scans can be represented as 237×197×189 matrices. To ensure the compatibility between training models and inputs, each 3D image is sliced, and 4500 2D MRI-CT image pairs are selected as the final training data.

    Source database: 1. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=33948305 2. https://wiki.cancerimagingarchive.net/display/Public/CPTAC-GBM 3. https://wiki.cancerimagingarchive.net/display/Public/TCGA-GBM

    Patient information: Number of patients: 41

    Introduction of each file: Dicom: contains the source file collected from the three websites above. data(processed): contains the processed data which are saved as .npy type. you can use the train_input.npy and train_output.npy as the input and output of the encoder-decoder structure to train the model. Test and Val input and output can be used as test and validation datasets.

  14. RNAdecayCafe: a uniformly reprocessed atlas of human RNA half-lives across...

    • zenodo.org
    application/gzip, bin +1
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Vock; Isaac Vock (2025). RNAdecayCafe: a uniformly reprocessed atlas of human RNA half-lives across 12 cell lines [Dataset]. http://doi.org/10.5281/zenodo.15785218
    Explore at:
    csv, bin, application/gzipAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Isaac Vock; Isaac Vock
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RNA half-life estimates from uniformly reprocessed/reanalyzed, published, high quality nucleotide recoding RNA-seq (NR-seq; namely SLAM-seq and TimeLapse-seq) datasets. 12 human cell lines are represented. Data can be browsed at this website.

    Analysis notes:

    • Data was processed using fastq2EZbakR. All config files used are provided in fastq2EZbakR_config.tar.gz (as well as some for data not included in the final RNAdecayCafe due to QC issues). Some general notes:
      • Multi-mapping reads were filtered out completely. It is difficult to do anything accurate/intelligent with such reads (see discussion here for instance), so better to just get rid of these completely. Does mean that some classes of features rich in repetitive sequences will be underrepresented in this database.
      • Adapters were trimmed. For 3'-end data, 12 additional nucleotides were trimmed from the 5' end of the reads, as suggested by the developers of the popular Quant-seq kit, and as is done by default in SLAMDUNK. Quality score end trimming and polyX trimming is also done for all samples.
    • Data was analyzed using EZbakR. Some general notes:
      • If -s4U data was included, this data was used to infer a global pold to stabilize pnew estimation (see Methods here for brief discussion). Dropout was also corrected using a previously devleoped strategy now implemented in EZbakR's CorrectDropout() function.
      • Half-lives are estimated on a gene-level. That is, all reads that map to exonic regions of a gene (i.e., regions that are exonic in at least one annotated isoform) are combined and used to estimate a half-life for that gene. Thus, you should think of these half-life estimates as a weighted average over all isoforms expressed from that gene, weighted by the relative abundances of those isoforms. Future releases may include isoform-resolution estimates as well, given that EZbakR can now perform this type of analysis.
    • A unique feature of RNAdecayCafe is that it includes what I am referring to as "dropout normalized" half-life and kdeg estimates. As not all datasets analyzed include -s4U data, dropout correction is not possible for all samples. This can lead to global biases in the average time scale of half-lives that is unlikely to represent real biology (that is, two different K562 datasets may have median half-life estimates of 4 hours and 15 hours). To address this problem and faciltiate comparison across datasets, I developed a strategy (implemented in EZbakR's NormalizeForDropout() function) that uses a model of dropout to normalize estimates with respect to a low dropout sample.
      • These "donorm" estimates will often be a more accurate reflection of rate constants and half-lives in a given cell line.
      • This strategy can normalize out real global differences in turnover kinetics though, so interpret these values with care.

    Relevant data provided in this repository are as follows:

    1. hg38_Strict.gtf: annotation used for analysis. Filtered similarly to how is described here.
    2. AvgKdegs_genes_v1.csv: Table of cell-line average half-lives and degradation rate constants (kdegs). Average log(kdeg)'s are calculated for all samples from a given cell line, weighting by the uncertainty in the log(kdeg) estimate. Columns in this table are as follows:
      1. feature_ID: Gene ID (symbol) from hg38_Strict.gtf
      2. cell_line: Human cell line for which averages are calculated
      3. avg_log_kdeg: Weighted log(kdeg) average
      4. avg_donorm_log_kdeg: Weighted dropout normalized log(kdeg) average.
      5. avg_log_RPKM_total: Average log(RPKM) value from total RNA data. A value of exactly 0 means that there was no total RNA data for this cell line (i.e., all data was 3'-end data).
      6. avg_log_RPKM_3pend: Average log(RPKM) value from 3'-end data. Technically no length normalization is performed as this is 3'-end data, so it is really an log(RPM). A value of exactly 0 means that there was no 3'-end data for this cell line (i.e., all data was for total RNA).
      7. avg_kdeg: e^avg_log_kdeg
      8. avg_donorm_kdeg: e^avg_donorm_log_kdeg
      9. avg_halflife: log(2)/avg_kdeg; can be thought of as average lifetime of the RNA.
      10. avg_donorm_halflife: log(2)/avg_donorm_kdeg
      11. avg_RPKM_total: e^avg_log_RPKM_total
      12. avg_RPKM_3pend: e^avg_log_RPKM_3pend.
    3. FeatureDetails_gene_v1.csv: Table of details about each gene measured; information comes from hg38_Strict.gtf and the corresponding hg38 genome FASTA file.
      1. seqnames: chromosome name
      2. strand: strand on which gene is transcribed
      3. start: genomic start position for gene (most 5'-end coordinate; will be location of TES for - strand genes).
      4. end: end position for gene
      5. type: all "gene" for now, as all analyses are currently gene-level average half-life calculations
      6. exon_length: length of union of exons for a given gene. A read is considered exonic, and thus used for half-life estimation, if it exclusively overlaps with the region defined by the union of all annotated exons for that gene.
      7. exon_GC_fraction: fraction of nucleotides in union of exons that are Gs or Cs.
      8. end_GC_fraction: fraction of nucleotides in last 1000 nts of 3'end of transcript that are Gs or Cs. Useful for assessing GC biases in 3'-end data.
      9. feature_ID: Gene ID (symbol) from hg38_Strict.gtf
    4. SampleDetails_v1.csv: Table of details about all samples represented in RNAdecayCafe
      1. sample: SRA accession ID for sample
      2. dataset: Citation-esque summary of the dataset of origin
      3. pnew: EZbakR estimated T-to-C mutation rate in reads from new (labeled) RNA. You can see these blogs (here and here) for some intuition as to how to interpret these. More technical explanations of the models involved can be found here and here.
      4. pold: EZbakR estimated T-to-C mutation rate in reads from old (unlabeled) RNA. Same citations for pnew apply. Best samples are those with the largest gap between the pold and pnew; can think of this like a signal-to-noise ratio
      5. label_time: How long (in hours) were cells labeled with s4U for?
      6. cell_line: Cell line used for that sample.
      7. threePseq: TRUE or FALSE; TRUE if 3'-end sequencing was used.
      8. total_reads: Total number of aligned, exonic reads in the sample.
      9. median_halflife: Median, unnormalized half-life estimate. Differences between cell lines could represent real biology, but could also be evidence of dropout (see here and here for discussion of this phenomenon).
    5. RateConstants_gene_v1.csv
      1. sample: SRA accession ID for sample.
      2. kdeg: e ^ log_kdeg
      3. halflife: log(2) / kdeg. Can be thought of as the average lifetime of the RNA.
      4. donorm_kdeg: e ^ donorm_log_kdeg
      5. donorm_halflife: log(2) / donorm_kdeg
      6. log_kdeg: log degradation rate constant estimated by EZbakR
      7. donorm_log_kdeg: dropout normalized log degradation rate constant
      8. reads: number of reads that contributed to estimates
      9. donorm_reads: dropout normalization corrected read count
      10. feature_ID: Gene ID (symbol) from hg38_Strict.gtf
    6. RNAdecayCafe_database_v1.rds: compressed RDS file that stores a list containing the above 4 tables in the following entries:
      1. kdegs = RateConstants_gene_v1.csv
      2. sample_metadata = SampleDetails_v1.csv
      3. feature_metadata = FeatureDetails_v1.csv
      4. average_kdegs = AvgKdegs_gene_v1.csv
    7. RNAdecayCafe_v1_onetable.csv: Inner joining of all but the averages table in this database. Thus, is one mega table containing all sample-specific estimates, sample metadata, and feature information.

    Datasets included:

    1. Finkel et al. 2021 (Calu3 cells; PMID: 35313595)
    2. Harada et al. 2022 (MV411 cells; PMID: 35301220)
    3. Ietswaart et al. 2024 (K562 cells; PMID: 38964322); whole-cell data used
    4. Luo et al. 2020 (HEK293T cells; PMID: 33357462)
    5. Mabin et al. 2025 (HEK293T cells; PMID: 40161772); only dataset for which data is not yet publicly

  15. CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thilde Terkelsen; Anders Krogh; Elena Papaleo (2023). CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for the analysis of quantitative biological data [Dataset]. http://doi.org/10.1371/journal.pcbi.1007665
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Thilde Terkelsen; Anders Krogh; Elena Papaleo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.

  16. Zurich Summer Dataset

    • zenodo.org
    zip
    Updated Jan 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vittorio Ferrari Michele Volpi; Vittorio Ferrari Michele Volpi (2022). Zurich Summer Dataset [Dataset]. http://doi.org/10.5281/zenodo.5914759
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vittorio Ferrari Michele Volpi; Vittorio Ferrari Michele Volpi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Zürich
    Description

    The "Zurich Summer v1.0" dataset is a collection of 20 chips (crops), taken from a QuickBird acquisition of the city of Zurich (Switzerland) in August 2002. QuickBird images are composed by 4 channels (NIR-R-G-B) and were pansharpened to the PAN resolution of about 0.62 cm GSD. We manually annotated 8 different urban and periurban classes : Roads, Buildings, Trees, Grass, Bare Soil, Water, Railways and Swimming pools. The cumulative number of class samples is highly unbalanced, to reflect real world situations. Note that annotations are not perfect, are not ultradense (not every pixel is annotated) and there might be some errors as well. We performed annotations by jointly selecting superpixels (SLIC) and drawing (freehand) over regions which we could confidently assign an object class.

    The dataset is composed by 20 image - ground truth pairs, in geotiff format. Images are distributed in raw DN values. We provide a rough and dirty MATLAB script (preprocess.m) to:

    i) extract basic statistics from images (min, max, mean and average std) which should be used to globally normalize the data (note that class distribution of the chips is highly uneven, so single-frame normalization would shift distribution of classes).

    ii) Visualize raw DN images (with unsaturated values) and a corresponding stretched version (good for illustration purposes). It also saves a raw and adjusted image version in MATLAB format (.mat) in a local subfolder.

    iii) Convert RGB annotations to index mask (CLASS \in {1,...,C}) (via rgb2label.m provided).

    iv) Convert index mask to georeferenced RGB annotations (via rgb2label.m provided). Useful if you want to see the final maps of the tiles in some GIS software (coordinate system copied from original geotiffs).

    Some requests from you

    We encourage researchers to report the ID of images used for training / validation / test (e.g. train: zh1 to zh7, validation zh8 to zh12 and test zh13 to zh20). The purpose of distributing datasets is to encourage reproducibility of experiments.

    Acknowledgements

    We release this data after a kind agreement obtained with DigitalGlobe, co. This data can be redistributed freely, provided that this document and corresponding license are part of the distribution. Ideally, since the dataset could be updated over the time, I suggest to distribute the dataset by the official link from which this archive has been downloaded.

    We would like to thank (a lot) Nathan Longbotham @ DigitalGlobe and the whole DG team for his / their help for granting the distribution of the dataset.

    We release this dataset hoping that will help researchers working in semantic classification / segmentation of remote sensing data in comparing to other state-of-the-art methods using this dataset as well in testing models on a larger and more complete set of images (with respect to most benchmarks available in our community). As you can imagine, it has been a tedious work in preparing everything. Just for you.

    If you are using the data please cite the following work

  17. Predictive Validity Data Set

    • figshare.com
    txt
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Antonio Abeyta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.

  18. DISASTER-SWARM EDGE SIMULATED_DATASAET

    • kaggle.com
    zip
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAHUL BALAJI (2025). DISASTER-SWARM EDGE SIMULATED_DATASAET [Dataset]. https://www.kaggle.com/datasets/rahul13289/disaster-swarm-edge-simulated-datasaet/discussion
    Explore at:
    zip(98530 bytes)Available download formats
    Dataset updated
    Sep 4, 2025
    Authors
    RAHUL BALAJI
    Description

    Text (SOS Messages)

    • Clean, tokenize, embed using BERT/DistilBERT.

    • Output: vector embedding (e.g., [768]).

    IoT Sensor Data (Time Series)

    • Normalize values (0–1).

    • Extract features (anomaly score, mean, variance, FFT features).

    • Output: vector embedding (e.g., [64]).

    Images (Aerial/Drone)

    • Use CNN (YOLOv8, ResNet, EfficientNet) for feature extraction.

    • Output: feature map → vector (e.g., [1024]).

  19. Data from: An automatic gain control circuit to improve ECG acquisition

    • scielo.figshare.com
    jpeg
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Rovetta; João Fernando Refosco Baggio; Raimes Moraes (2023). An automatic gain control circuit to improve ECG acquisition [Dataset]. http://doi.org/10.6084/m9.figshare.5668756.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Marco Rovetta; João Fernando Refosco Baggio; Raimes Moraes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Introduction Long-term electrocardiogram (ECG) recordings are widely employed to assist the diagnosis of cardiac and sleep disorders. However, variability of ECG amplitude during the recordings hampers the detection of QRS complexes by algorithms. This work presents a simple electronic circuit to automatically normalize the ECG amplitude, improving its sampling by analog to digital converters (ADCs). Methods The proposed circuit consists of an analog divider that normalizes the ECG amplitude using its absolute peak value as reference. The reference value is obtained by means of a full-wave rectifier and a peak voltage detector. The circuit and tasks of its different stages are described. Results Example of the circuit performance for a bradycardia ECG signal (40bpm) is presented; the signal has its amplitude suddenly halved, and later, restored. The signal is automatically normalized after 5 heart beats for the amplitude drop. For the amplitude increase, the signal is promptly normalized. Conclusion The proposed circuit adjusts the ECG amplitude to the input voltage range of ADC, avoiding signal to noise ratio degradation of the sampled waveform in order to allow a better performance of processing algorithms.

  20. Student Academic Performance (Synthetic Dataset)

    • kaggle.com
    zip
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset
    Explore at:
    zip(9287 bytes)Available download formats
    Dataset updated
    Oct 10, 2025
    Authors
    Mamun Hasan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

    The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

    • Handling missing values
    • Removing duplicates
    • Detecting and treating outliers
    • Data normalization and transformation
    • Encoding categorical variables
    • Exploratory data analysis (EDA)
    • Regression Analysis

    📊 Columns Description

    Column NameDescription
    Student_IDUnique identifier for each student (e.g., S0001, S0002, …)
    AgeAge of the student (between 18 and 25 years)
    GenderGender of the student (Male/Female)
    Study_HoursAverage number of study hours per day (contains missing values and outliers)
    Attendance(%)Percentage of class attendance (contains missing values)
    Test_ScoreFinal exam score (0–100 scale)
    GradeLetter grade derived from test scores (F, C, B, A, A+)

    🧠 Example Lab Tasks Using This Dataset:

    • Identify and impute missing values using mean/median.
    • Detect and remove duplicate records.
    • Use IQR or Z-score methods to handle outliers.
    • Normalize Study_Hours and Test_Score using Min-Max scaling.
    • Encode categorical variables (Gender, Grade) for model input.
    • Prepare a clean dataset ready for classification/regression analysis.
    • Can be used for Limited Regression

    🎯 Possible Regression Targets

    Test_Score → Predict test score based on study hours, attendance, age, and gender.

    🧩 Example Regression Problem

    Predict the student’s test score using their study hours, attendance percentage, and age.

    🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

    You can use:

    • Linear Regression (for simplicity)
    • Polynomial Regression (to explore nonlinear patterns)
    • Decision Tree Regressor or Random Forest Regressor

    And analyze feature influence using correlation or SHAP/LIME explainability.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Quentin Vicens; Alain Laederach (2006). noRNAlize: SHAPE data normalization software [Dataset]. https://simtk.org/frs/?group_id=159

noRNAlize: SHAPE data normalization software

Explore at:
53 scholarly articles cite this dataset (View in Google Scholar)
application/zip,application/pdf,text/plain(37 MB)Available download formats
Dataset updated
Nov 14, 2006
Dataset provided by
Consulting
UNC-Chapel Hill
Authors
Quentin Vicens; Alain Laederach
Description

This project is a data analysis package to analyze and normalize SHAPE data. Traditionally, SHAPE requires the addition of a 3' hairpin to the RNA for normalization. noRNAlize elminates the need for this experimental step by performing a global analysis of the SHAPE data, and establishing mean protection values. This is particularly important when SHAPE analysis is used to map crystal contacts in crystal structures as illustrated here.



This project includes the following software/data packages:

  • v1.0 : Alpha Release or noRNAlize

Search
Clear search
Close search
Google apps
Main menu