100+ datasets found

c
Data from: LVMED: Dataset of Latvian text normalisation samples for the...
repository.clarin.lv
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
d
Data from: A systematic evaluation of normalization methods and probe...
dataone.org
data.niaid.nih.gov
+2more
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2025). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 1, 2025
Dataset provided by
Dryad Digital Repository
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
Time period covered
Jan 1, 2022
Description
Background The Infinium EPIC array measures the methylation status ofâ€‰>â€‰850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearsonâ€™s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.Â
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the b...
A comparison of per sample global scaling and per gene normalization methods...
plos.figshare.com
pdf
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai (2023). A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0176185
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0176185
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (
Hospital Management System
kaggle.com
zip
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Shamoon Butt (2025). Hospital Management System [Dataset]. https://www.kaggle.com/mshamoonbutt/hospital-management-system
Explore at:
zip(1049391 bytes)Available download formats
Dataset updated
Jun 9, 2025
Authors
Muhammad Shamoon Butt
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Hospital Management System project features a fully normalized relational database designed to manage hospital data including patients, doctors, appointments, diagnoses, medications, and billing. The schema applies database normalization (1NF, 2NF, 3NF) to reduce redundancy and maintain data integrity, providing an efficient, scalable structure for healthcare data management. Included are SQL scripts to create tables and insert sample data, making it a useful resource for learning practical database design and normalization in a healthcare context.
GSE206848 Data Normalization and Subtype Analysis
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). GSE206848 Data Normalization and Subtype Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/gse206848-data-normalization-and-subtype-analysis
Explore at:
zip(2631363 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset for human osteoarthritis (OA) — microarray gene expression (Affymetrix GPL570) PMC +1

Contains expression data for 7 healthy control (normal) tissue samples and 7 osteoarthritis patient tissue samples from synovial / joint tissue. PMC +1

Pre-processed for normalization (background correction, log-transformation, normalization) to remove technical variation.

Suitable for downstream analyses: differential gene expression (normal vs OA), subtype- or phenotype-based classification, machine learning.

Can act as a validation dataset when combining with other GEO datasets to increase sample size or test reproducibility. SpringerLink +1

Useful for biomarker discovery, pathway enrichment analysis (e.g., GO, KEGG), immune infiltration analysis, and subtype analysis in osteoarthritis research.
f
Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...
datasetcatalog.nlm.nih.gov
Updated Sep 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris (2020). proteiNorm – A User-Friendly Tool for Normalization and Analysis of TMT and Label-Free Protein Quantification [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000568582
Explore at:
Dataset updated
Sep 30, 2020
Authors
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris
Description
The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
f
File S1 - Normalization of RNA-Sequencing Data from Samples with Varying...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Feb 25, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan (2014). File S1 - Normalization of RNA-Sequencing Data from Samples with Varying mRNA Levels [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001266682
Explore at:
Dataset updated
Feb 25, 2014
Authors
Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan
Description
Table S1 and Figures S1–S6. Table S1. List of primers. Forward and reverse primers used for qPCR. Figure S1. Changes in total and polyA+ RNA during development. a) Amount of total RNA per embryo at different developmental stages. b) Amount of polyA+ RNA per 100 embryos at different developmental stages. Vertical bars represent standard errors. Figure S2. The TMM scaling factor. a) The TMM scaling factor estimated using dataset 1 and 2. We observe very similar values. b) The TMM scaling factor obtained using the replicates in dataset 2. The TMM values are very reproducible. c) The TMM scale factor when RNA-seq data based on total RNA was used. Figure S3. Comparison of scales. We either square-root transformed or used that scales directly and compared the normalized fold-changes to RT-qPCR results. a) Transcripts with dynamic change pre-ZGA. b) Transcripts with decreased abundance post-ZGA. c) Transcripts with increased expression post-ZGA. Vertical bars represent standard deviations. Figure S4. Comparison of RT-qPCR results depending on RNA template (total or poly+ RNA) and primers (random or oligo(dT) primers) for setd3 (a), gtf2e2 (b) and yy1a (c). The increase pre-ZGA is dependent on template (setd3 and gtf2e2) and not primer type. Figure S5. Efficiency calibrated fold-changes for a subset of transcripts. Vertical bars represent standard deviations. Figure S6. Comparison normalization methods using dataset 2 for transcripts with decreased expression post-ZGA (a) and increased expression post-ZGA (b). Vertical bars represent standard deviations. (PDF)
Arabic OCR Project Dataset
kaggle.com
zip
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yousef Gomaa (2025). Arabic OCR Project Dataset [Dataset]. https://www.kaggle.com/datasets/yousefgomaa43/arabic-ocr-project-dataset
Explore at:
zip(1873466285 bytes)Available download formats
Dataset updated
Nov 2, 2025
Authors
Yousef Gomaa
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Summary

Arabic handwritten paragraph dataset to be used for text normalization and generation using conditional deep generative models, such as:

Conditional Variational Autoencoder (CVAE)

Conditional Generative Adversarial Network (cGAN) (any GAN variant such as Pix2Pix, CycleGAN, or StyleGAN2)

Transformer-based generator (e.g., Vision Transformer with autoregressive decoding or text-to-image transformer)

Data Example

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17351483%2Fe1f10b4e62e5186c26dbe1f6741e3bdc%2F43.jpg?generation=1761401307913748&alt=media" alt="43.jpg">

Usage Examples

1. Preprocessing & Data Analysis:

Dataset exploration and cleaning

Character/word-level segmentation and normalization

Data augmentation (e.g., rotation, distortion)

2. Model Implementation:

Model 1: Conditional Variational Autoencoder (CVAE)

Model 2: Conditional GAN (any variant such as Pix2Pix or StyleGAN2)

Model 3: Transformer-based handwriting generator

3. Evaluation:

Quantitative metrics:

FID (Fréchet Inception Distance)

SSIM (Structural Similarity Index)

PSNR (Peak Signal-to-Noise Ratio)

Qualitative comparison:

Visual quality and handwriting consistency

Accuracy in representing Arabic characters and diacritics
d
Data from: Evaluation of normalization procedures for oligonucleotide array...
catalog.data.gov
odgavaprod.ogopendata.com
+1more
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls [Dataset]. https://catalog.data.gov/dataset/evaluation-of-normalization-procedures-for-oligonucleotide-array-data-based-on-spiked-crna
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Background Affymetrix oligonucleotide arrays simultaneously measure the abundances of thousands of mRNAs in biological samples. Comparability of array results is necessary for the creation of large-scale gene expression databases. The standard strategy for normalizing oligonucleotide array readouts has practical drawbacks. We describe alternative normalization procedures for oligonucleotide arrays based on a common pool of known biotin-labeled cRNAs spiked into each hybridization. Results We first explore the conditions for validity of the 'constant mean assumption', the key assumption underlying current normalization methods. We introduce 'frequency normalization', a 'spike-in'-based normalization method which estimates array sensitivity, reduces background noise and allows comparison between array designs. This approach does not rely on the constant mean assumption and so can be effective in conditions where standard procedures fail. We also define 'scaled frequency', a hybrid normalization method relying on both spiked transcripts and the constant mean assumption while maintaining all other advantages of frequency normalization. We compare these two procedures to a standard global normalization method using experimental data. We also use simulated data to estimate accuracy and investigate the effects of noise. We find that scaled frequency is as reproducible and accurate as global normalization while offering several practical advantages. Conclusions Scaled frequency quantitation is a convenient, reproducible technique that performs as well as global normalization on serial experiments with the same array design, while offering several additional features. Specifically, the scaled-frequency method enables the comparison of expression measurements across different array designs, yields estimates of absolute message abundance in cRNA and determines the sensitivity of individual arrays.
S
Sample of Yidu-N7K data set
scidb.cn
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zengtao Jiao (2021). Sample of Yidu-N7K data set [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00095
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00104.00095
Dataset updated
Aug 31, 2021
Dataset provided by
Science Data Bank
Authors
Zengtao Jiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
[instructions for use] 1. This data set is manually edited by Yidu cloud medicine according to the real medical record distribution; 2. This dataset is an example of the yidu-n7k dataset on openkg. Yidu-n7k dataset can only be used for academic research of natural language processing, not for commercial purposes. ———————————————— Yidu-n4k data set is derived from chip 2019 evaluation task 1, that is, the data set of "clinical terminology standardization task". The standardization of clinical terms is an indispensable task in medical statistics. Clinically, there are often hundreds of different ways to write about the same diagnosis, operation, medicine, examination, test and symptoms. The problem to be solved in Standardization (normalization) is to find the corresponding standard statement for various clinical statements. With the basis of terminology standardization, researchers can carry out subsequent statistical analysis of EMR. In essence, the task of clinical terminology standardization is also a kind of semantic similarity matching task. However, due to the diversity of original word expressions, a single matching model is difficult to achieve good results. Yidu cloud, a leading medical artificial intelligence technology company in the industry, is also the first Unicorn company to drive medical innovation solutions with data intelligence. With the mission of "data intelligence and green medical care" and the goal of "improving the relationship between human beings and diseases", Yidu cloud uses data artificial intelligence to help the government, hospitals and the whole industry fully tap the intelligent political and civil value of medical big data, and build a big data ecological platform for the medical industry that can cover the whole country, make overall utilization and unified access. Since its establishment in 2013, Yidu cloud has gathered world-renowned scientists and the best people in the professional field to form a strong talent team. The company has invested hundreds of millions of yuan in R & D and service system establishment every year, built a medical data intelligent platform with large data processing capacity, high data integrity and transparent development process, and has obtained more than dozens of software copyrights and national invention patents.
d
Data from: Accurate normalization of real-time quantitative RT-PCR data by...
catalog.data.gov
odgavaprod.ogopendata.com
+1more
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes [Dataset]. https://catalog.data.gov/dataset/accurate-normalization-of-real-time-quantitative-rt-pcr-data-by-geometric-averaging-of-mul
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Using real-time reverse transcription PCR ten housekeeping genes from different abundance and functional classes in various human tissues were evaluated. The conventional use of a single gene for normalization leads to relatively large errors in a significant proportion of samples tested.
avila_dataset
kaggle.com
zip
Updated May 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HRITABAN GHOSH (2022). avila_dataset [Dataset]. https://www.kaggle.com/datasets/hritaban02/avila-dataset
Explore at:
zip(604026 bytes)Available download formats
Dataset updated
May 10, 2022
Authors
HRITABAN GHOSH
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is made from the Avila dataset obtained from the UCI Machine Learning Repository. Here is the description of the data from the above source:

Data Set Information:

Data have been normalized by using the Z-normalization method and divided into two data sets: a training set containing 10430 samples, and a test set containing the 10437 samples.

CLASS DISTRIBUTION (training set) A: 4286 B: 5 C: 103 D: 352 E: 1095 F: 1961 G: 446 H: 519 I: 831 W: 44 X: 522 Y: 266

Attribute Information:

F1: intercolumnar distance F2: upper margin F3: lower margin F4: exploitation F5: row number F6: modular ratio F7: interlinear spacing F8: weight F9: peak number F10: modular ratio/ interlinear spacing Class: A, B, C, D, E, F, G, H, I, W, X, Y
Sample dataset for the models trained and tested in the paper 'Can AI be...
zenodo.org
zip
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12934521
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

To use the data, clone the corresponding repository and unzip this zip file in the data folder.
Comparison of normalization approaches for gene expression studies completed...
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley (2023). Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing [Dataset]. http://doi.org/10.1371/journal.pone.0206312
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0206312
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L. Fridley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA (“BE”) method outperforms the other methods (SVA “Leek”, PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.
f
DataSheet1_TimeNorm: a novel normalization method for time course microbiome...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei (2024). DataSheet1_TimeNorm: a novel normalization method for time course microbiome data.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001407445
Explore at:
Dataset updated
Sep 24, 2024
Authors
An, Lingling; Lu, Meng; Butt, Hamza; Luo, Qianwen; Du, Ruofei; Lytal, Nicholas; Jiang, Hongmei
Description
Metagenomic time-course studies provide valuable insights into the dynamics of microbial systems and have become increasingly popular alongside the reduction in costs of next-generation sequencing technologies. Normalization is a common but critical preprocessing step before proceeding with downstream analysis. To the best of our knowledge, currently there is no reported method to appropriately normalize microbial time-series data. We propose TimeNorm, a novel normalization method that considers the compositional property and time dependency in time-course microbiome data. It is the first method designed for normalizing time-series data within the same time point (intra-time normalization) and across time points (bridge normalization), separately. Intra-time normalization normalizes microbial samples under the same condition based on common dominant features. Bridge normalization detects and utilizes a group of most stable features across two adjacent time points for normalization. Through comprehensive simulation studies and application to a real study, we demonstrate that TimeNorm outperforms existing normalization methods and boosts the power of downstream differential abundance analysis.
m
An Extensive Dataset for the Heart Disease Classification System
data.mendeley.com
Updated Feb 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sozan S. Maghdid (2022). An Extensive Dataset for the Heart Disease Classification System [Dataset]. http://doi.org/10.17632/65gxgy2nmg.1
Explore at:
Unique identifier
https://doi.org/10.17632/65gxgy2nmg.1
Dataset updated
Feb 15, 2022
Authors
Sozan S. Maghdid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.
d
Data from: Normalized Foraminiferal Data for Chincoteague Bay and the...
catalog.data.gov
datasets.ai
Updated Nov 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Normalized Foraminiferal Data for Chincoteague Bay and the Marshes of Assateague Island and the Adjacent Vicinity, Maryland and Virginia-Spring 2014 [Dataset]. https://catalog.data.gov/dataset/normalized-foraminiferal-data-for-chincoteague-bay-and-the-marshes-of-assateague-island-an
Explore at:
Dataset updated
Nov 12, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Assateague Island, Virginia, Maryland, Chincoteague Bay
Description
Foraminiferal samples were collected from Chincoteague Bay, Newport Bay, and Tom’s Cove as well as the marshes on the back-barrier side of Assateague Island and the Delmarva (Delaware-Maryland-Virginia) mainland by U.S. Geological Survey (USGS) researchers from the St. Petersburg Coastal and Marine Science Center in March, April (14CTB01), and October (14CTB02) 2014. Samples were also collected by the Woods Hole Coastal and Marine Science Center (WHCMSC) in July 2014 and shipped to the St. Petersburg office for processing. The dataset includes raw foraminiferal and normalized counts for the estuarine grab samples (G), terrestrial surface samples (S), and inner shelf grab samples (G). For further information regarding data collection and sample site coordinates, processing methods, or related datasets, please refer to USGS Data Series 1060 (https://doi.org/10.3133/ds1060), USGS Open-File Report 2015–1219 (https://doi.org/10.3133/ofr20151219), and USGS Open-File Report 2015-1169 (https://doi.org/10.3133/ofr20151169). Downloadable data are available as Excel spreadsheets, comma-separated values text files, and formal Federal Geographic Data Committee metadata.
E
Data from: Dataset of normalised Slovene text KonvNormSl 1.0
live.european-language-grid.eu
binary format
Updated Sep 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8217
Explore at:
binary formatAvailable download formats
Dataset updated
Sep 18, 2016
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data used in the experiments described in:

Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)

Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)

The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).

The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
e
Isobaric matching between runs and novel PSM-level normalization in MaxQuant...
ebi.ac.uk
data-staging.niaid.nih.gov
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sung-Huan Yu, Isobaric matching between runs and novel PSM-level normalization in MaxQuant strongly improve reporter ion-based quantification - Peli1 knock-out mice dataset [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD019881
Explore at:
Authors
Sung-Huan Yu
Variables measured
Proteomics
Description
Isobaric labeling has the promise of combining high sample multiplexing with precise quantification. However, normalization issues and the missing value problem of complete n-plexes hamper quantification across more than one n-plex. Here we introduce two novel algorithms implemented in MaxQuant that substantially improve the data analysis with multiple n-plexes. First, isobaric matching between runs (IMBR) makes use of the three-dimensional MS1 features to transfer identifications from identified to unidentified MS/MS spectra between LC-MS runs in order to utilize reporter ion intensities in unidentified spectra for quantification. On typical datasets, we observe a significant gain in quantifiable n-plexesMS/MS spectra that can be used for quantification. Second, we introduce a novel PSM-level normalization, applicable to data with and without common reference channel. It is a weighted median-based method, in which the weights reflect the number of ions that were used for fragmentation. On a typical dataset, we observe complete removal of batch effects and dominance of the biological sample grouping after normalization. This dataset is one of the datasets used for the study. It is TMT 10-plex with a reference channel.

Facebook

Twitter

Click to copy link

Link copied

Cite

Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85

Data from: LVMED: Dataset of Latvian text normalisation samples for the medical domain

Explore at:

Dataset updated

May 30, 2023

Authors

Viesturs Jūlijs Lasmanis; Normunds Grūzītis

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

Clear search

Close search

Google apps

Main menu

Data from: LVMED: Dataset of Latvian text normalisation samples for the...

Data from: A systematic evaluation of normalization methods and probe...

A comparison of per sample global scaling and per gene normalization methods...

Hospital Management System

GSE206848 Data Normalization and Subtype Analysis

Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

File S1 - Normalization of RNA-Sequencing Data from Samples with Varying...

Arabic OCR Project Dataset

Summary

Data Example

Usage Examples

1. Preprocessing & Data Analysis:

2. Model Implementation:

3. Evaluation:

Quantitative metrics:

Qualitative comparison:

Data from: Evaluation of normalization procedures for oligonucleotide array...

Sample of Yidu-N7K data set

Data from: Accurate normalization of real-time quantitative RT-PCR data by...

avila_dataset

Sample dataset for the models trained and tested in the paper 'Can AI be...

Comparison of normalization approaches for gene expression studies completed...

DataSheet1_TimeNorm: a novel normalization method for time course microbiome...

An Extensive Dataset for the Heart Disease Classification System

Data from: Normalized Foraminiferal Data for Chincoteague Bay and the...

Data from: Dataset of normalised Slovene text KonvNormSl 1.0

Isobaric matching between runs and novel PSM-level normalization in MaxQuant...

Data from: LVMED: Dataset of Latvian text normalisation samples for the medical domain