24 datasets found
  1. n

    Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Hospital for Sick Children
    Universidade de São Paulo
    University of Toronto
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  2. MNIST Preprocessed

    • kaggle.com
    zip
    Updated Jul 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentyn Sichkar (2019). MNIST Preprocessed [Dataset]. https://www.kaggle.com/valentynsichkar/mnist-preprocessed
    Explore at:
    zip(114752429 bytes)Available download formats
    Dataset updated
    Jul 24, 2019
    Authors
    Valentyn Sichkar
    Description

    📰 Related Paper

    Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)

    Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html


    :mortar_board: Related course for classification tasks

    Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">


    🗺️ Concept Map of the Course

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">


    👉 Join the Course

    https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/


    Content

    This is ready to use preprocessed data saved into pickle file.
    Preprocessing stages are as follows:
    - Normalizing whole data by dividing / 255.0.
    - Dividing whole data into three datasets: train, validation and test.
    - Normalizing whole data by subtracting mean image and dividing by standard deviation.
    - Transposing every dataset to make channels come first.


    mean image and standard deviation were calculated from train dataset and applied to all datasets.
    When using user's image for classification, it has to be preprocessed firstly in the same way: normalized, subtracted with mean image and divided by standard deviation.


    Data written as dictionary with following keys:
    x_train: (59000, 1, 28, 28)
    y_train: (59000,)
    x_validation: (1000, 1, 28, 28)
    y_validation: (1000,)
    x_test: (1000, 1, 28, 28)
    y_test: (1000,)


    Contains pretrained weights model_params_ConvNet1.pickle for the model with following architecture:
    Input --> Conv --> ReLU --> Pool --> Affine --> ReLU --> Affine --> Softmax


    Parameters:

    • Input is 1-channeled GrayScale image.
    • 32 filters of Convolutional Layer.
    • Stride for Pool is 2 and height = width = 2.
    • Number of hidden neurons is 500.
    • Number of output neurons is 10.


    Architecture also can be understood as follows:
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">

    Acknowledgements

    Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.

  3. d

    Data from: Evaluation of normalization procedures for oligonucleotide array...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    • +1more
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls [Dataset]. https://catalog.data.gov/dataset/evaluation-of-normalization-procedures-for-oligonucleotide-array-data-based-on-spiked-crna
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Affymetrix oligonucleotide arrays simultaneously measure the abundances of thousands of mRNAs in biological samples. Comparability of array results is necessary for the creation of large-scale gene expression databases. The standard strategy for normalizing oligonucleotide array readouts has practical drawbacks. We describe alternative normalization procedures for oligonucleotide arrays based on a common pool of known biotin-labeled cRNAs spiked into each hybridization. Results We first explore the conditions for validity of the 'constant mean assumption', the key assumption underlying current normalization methods. We introduce 'frequency normalization', a 'spike-in'-based normalization method which estimates array sensitivity, reduces background noise and allows comparison between array designs. This approach does not rely on the constant mean assumption and so can be effective in conditions where standard procedures fail. We also define 'scaled frequency', a hybrid normalization method relying on both spiked transcripts and the constant mean assumption while maintaining all other advantages of frequency normalization. We compare these two procedures to a standard global normalization method using experimental data. We also use simulated data to estimate accuracy and investigate the effects of noise. We find that scaled frequency is as reproducible and accurate as global normalization while offering several practical advantages. Conclusions Scaled frequency quantitation is a convenient, reproducible technique that performs as well as global normalization on serial experiments with the same array design, while offering several additional features. Specifically, the scaled-frequency method enables the comparison of expression measurements across different array designs, yields estimates of absolute message abundance in cRNA and determines the sensitivity of individual arrays.

  4. CIFAR10 Preprocessed

    • kaggle.com
    zip
    Updated Jul 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentyn Sichkar (2019). CIFAR10 Preprocessed [Dataset]. https://www.kaggle.com/datasets/valentynsichkar/cifar10-preprocessed/code
    Explore at:
    zip(1227571899 bytes)Available download formats
    Dataset updated
    Jul 13, 2019
    Authors
    Valentyn Sichkar
    Description

    📰 Related Paper

    Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)

    Test online with custom Traffic Sign here: https://valentynsichkar.name/cifar10.html


    :mortar_board: Related course for classification tasks

    Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">


    🗺️ Concept Map of the Course

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">


    👉 Join the Course

    https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/


    Content

    This is ready to use preprocessed data saved into pickle file.
    Preprocessing stages are as follows:
    - Normalizing whole data by dividing / 255.0.
    - Dividing whole data into three datasets: train, validation and test.
    - Normalizing whole data by subtracting mean image and dividing by standard deviation.
    - Transposing every dataset to make channels come first.


    mean image and standard deviation were calculated from train dataset and applied to all datasets.
    When using user's image for classification, it has to be preprocessed firstly in the same way: normalized, subtracted with mean image and divided by standard deviation.


    Data written as dictionary with following keys:
    x_train: (49000, 3, 32, 32)
    y_train: (49000,)
    x_validation: (1000, 3, 32, 32)
    y_validation: (1000,)
    x_test: (1000, 3, 32, 32)
    y_test: (1000,)


    Contains pretrained weights model_params_ConvNet1.pickle for the model with following architecture:
    Input --> Conv --> ReLU --> Pool --> Affine --> ReLU --> Affine --> Softmax


    Parameters:

    • Input is 3-channeled RGB image.
    • 32 filters of Convolutional Layer.
    • Stride for Pool is 2 and height = width = 2.
    • Number of hidden neurons is 500.
    • Number of output neurons is 10.


    Architecture also can be understood as follows:
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2F5d50bf46a9494d60016759b4690e6662%2FModel_1_Architecture.png?generation=1563650302359604&alt=media" alt="">

    Acknowledgements

    Initial data is CIFAR-10 that was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

  5. f

    Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  6. n

    Ambiguity in medical concept normalization: An analysis of types and...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Newman-Griffis; Guy Divita; Bart Desmet; Ayah Zirikly; Carolyn Rosé; Eric Fosler-Lussier (2021). Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets [Dataset]. http://doi.org/10.5061/dryad.r4xgxd29w
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 22, 2021
    Dataset provided by
    The Ohio State University
    National Institutes of Health Clinical Center
    Carnegie Mellon University
    University of Pittsburgh
    Authors
    Denis Newman-Griffis; Guy Divita; Bart Desmet; Ayah Zirikly; Carolyn Rosé; Eric Fosler-Lussier
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: Normalizing mentions of medical concepts to standardized vocabularies is a fundamental component of clinical text analysis. Ambiguity—words or phrases that may refer to different concepts—has been extensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for advancing medical concept normalization research.

    Materials and Methods: We identified ambiguous strings in datasets derived from the two available clinical corpora for concept normalization, and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets to potential ambiguity in the Unified Medical Language System (UMLS), to assess how representative available datasets are of ambiguity in clinical language.

    Results: We observed twelve distinct types of ambiguity, distributed unequally across the available datasets. However, less than 15% of the strings were ambiguous within the datasets, while over 50% were ambiguous in the UMLS, indicating only partial coverage of clinical ambiguity.

    Discussion: Existing datasets are not sufficient to cover the diversity of clinical concept ambiguity, limiting both training and evaluation of normalization methods for clinical text. Additionally, the UMLS offers important semantic information for building and evaluating normalization methods.

    Conclusion: Our findings identify three opportunities for concept normalization research, including a need for ambiguity-specific clinical datasets and leveraging the rich semantics of the UMLS in new methods and evaluation measures for normalization.

    Methods These data are derived from benchmark datasets released for Medical Concept Normalization research focused on Electronic Health Record (EHR) narratives. Data included in this release are derived from:

    SemEval-2015 Task 14 (Publication DOI: 10.18653/v1/S15-2051, data accessed through release at https://physionet.org/content/shareclefehealth2014task2/1.0/)
    CUILESS2016 (Publication DOI: 10.1186/s13326-017-0173-6, data accessed through release at https://physionet.org/content/cuiless16/1.0.0/)
    

    These datasets consist of EHR narratives with annotations including: (1) the portion of a narrative referring to a medical concept, such as a problem, treatment, or test; and (2) one or more Concept Unique Identifiers (CUIs) derived from the Unified Medical Language System (UMLS), identifying the reification of the medical concept being mentioned.

    The data were processed using the following procedure:

    All medical concept mention strings were preprocessed with lowercasing and removing of determiners ("a", "an", "the").
    All medical concept mentions were analyzed to identify strings that met the following conditions: (1) string occurred more than once in the dataset, and (2) string was annotated with at least two different CUIs, when aggregating across dataset samples. Strings meeting these conditions were considered "ambiguous strings".
    Ambiguous strings were reviewed by article authors to determine (1) the category and subcategory of ambiguity exhibited (derived from an ambiguity typology described in the accompanying article); and (2) whether the semantic differences in CUI annotations were reflected by differences in textual meaning (strings not meeting this criterion were termed "arbitrary").
    

    For more details, please see the accompanying article (DOI: 10.1093/jamia/ocaa269).

  7. Normalized Water Quality Data

    • figshare.com
    txt
    Updated Feb 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dov Stekel (2022). Normalized Water Quality Data [Dataset]. http://doi.org/10.6084/m9.figshare.19213386.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Dov Stekel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Water quality data. These data have been normalised to their means over the time period with a normalised mean of 100.

  8. Part 2 of real-time testing data for: "Identifying data sources and physical...

    • zenodo.org
    application/gzip
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Part 2 of real-time testing data for: "Identifying data sources and physical strategies used by neural networks to predict TC rapid intensification" [Dataset]. http://doi.org/10.5281/zenodo.13272877
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each file in the dataset contains machine-learning-ready data for one unique tropical cyclone (TC) from the real-time testing dataset. "Machine-learning-ready" means that all data-processing methods described in the journal paper have already been applied. This includes cropping satellite images to make them TC-centered; rotating satellite images to align them with TC motion (TC motion is always towards the +x-direction, or in the direction of increasing column number); flipping satellite images in the southern hemisphere upside-down; and normalizing data via the two-step procedure.

    The file name gives you the unique identifier of the TC -- e.g., "learning_examples_2010AL01.nc.gz" contains data for storm 2010AL01, or the first North Atlantic storm of the 2010 season. Each file can be read with the method `example_io.read_file` in the ml4tc Python library (https://zenodo.org/doi/10.5281/zenodo.10268620). However, since `example_io.read_file` is a lightweight wrapper for `xarray.open_dataset`, you can equivalently just use `xarray.open_dataset`. Variables in the table are listed below (the same printout produced by `print(xarray_table)`):

    Dimensions: (
    satellite_valid_time_unix_sec: 289,
    satellite_grid_row: 380,
    satellite_grid_column: 540,
    satellite_predictor_name_gridded: 1,
    satellite_predictor_name_ungridded: 16,
    ships_valid_time_unix_sec: 19,
    ships_storm_object_index: 19,
    ships_forecast_hour: 23,
    ships_intensity_threshold_m_s01: 21,
    ships_lag_time_hours: 5,
    ships_predictor_name_lagged: 17,
    ships_predictor_name_forecast: 129)
    Coordinates:
    * satellite_grid_row (satellite_grid_row) int32 2kB ...
    * satellite_grid_column (satellite_grid_column) int32 2kB ...
    * satellite_valid_time_unix_sec (satellite_valid_time_unix_sec) int32 1kB ...
    * ships_lag_time_hours (ships_lag_time_hours) float64 40B ...
    * ships_intensity_threshold_m_s01 (ships_intensity_threshold_m_s01) float64 168B ...
    * ships_forecast_hour (ships_forecast_hour) int32 92B ...
    * satellite_predictor_name_gridded (satellite_predictor_name_gridded) object 8B ...
    * satellite_predictor_name_ungridded (satellite_predictor_name_ungridded) object 128B ...
    * ships_valid_time_unix_sec (ships_valid_time_unix_sec) int32 76B ...
    * ships_predictor_name_lagged (ships_predictor_name_lagged) object 136B ...
    * ships_predictor_name_forecast (ships_predictor_name_forecast) object 1kB ...
    Dimensions without coordinates: ships_storm_object_index
    Data variables:
    satellite_number (satellite_valid_time_unix_sec) int32 1kB ...
    satellite_band_number (satellite_valid_time_unix_sec) int32 1kB ...
    satellite_band_wavelength_micrometres (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_longitude_deg_e (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_cyclone_id_string (satellite_valid_time_unix_sec) |S8 2kB ...
    satellite_storm_type_string (satellite_valid_time_unix_sec) |S2 578B ...
    satellite_storm_name (satellite_valid_time_unix_sec) |S10 3kB ...
    satellite_storm_latitude_deg_n (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_storm_longitude_deg_e (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_storm_intensity_number (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_storm_u_motion_m_s01 (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_storm_v_motion_m_s01 (satellite_valid_time_unix_sec) float64 2kB ...
    satellite_predictors_gridded (satellite_valid_time_unix_sec, satellite_grid_row, satellite_grid_column, satellite_predictor_name_gridded) float64 474MB ...
    satellite_grid_latitude_deg_n (satellite_valid_time_unix_sec, satellite_grid_row, satellite_grid_column) float64 474MB ...
    satellite_grid_longitude_deg_e (satellite_valid_time_unix_sec, satellite_grid_row, satellite_grid_column) float64 474MB ...
    satellite_predictors_ungridded (satellite_valid_time_unix_sec, satellite_predictor_name_ungridded) float64 37kB ...
    ships_storm_intensity_m_s01 (ships_valid_time_unix_sec) float64 152B ...
    ships_storm_type_enum (ships_storm_object_index, ships_forecast_hour) int32 2kB ...
    ships_forecast_latitude_deg_n (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_forecast_longitude_deg_e (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_v_wind_200mb_0to500km_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_vorticity_850mb_0to1000km_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_vortex_latitude_deg_n (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_vortex_longitude_deg_e (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_mean_tangential_wind_850mb_0to600km_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_max_tangential_wind_850mb_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_mean_tangential_wind_1000mb_at500km_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_mean_tangential_wind_850mb_at500km_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_mean_tangential_wind_500mb_at500km_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_mean_tangential_wind_300mb_at500km_m_s01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_srh_1000to700mb_200to800km_j_kg01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_srh_1000to500mb_200to800km_j_kg01 (ships_storm_object_index, ships_forecast_hour) float64 3kB ...
    ships_threshold_exceedance_num_6hour_periods (ships_storm_object_index, ships_intensity_threshold_m_s01) int32 2kB ...
    ships_v_motion_observed_m_s01 (ships_storm_object_index) float64 152B ...
    ships_v_motion_1000to100mb_flow_m_s01 (ships_storm_object_index) float64 152B ...
    ships_v_motion_optimal_flow_m_s01 (ships_storm_object_index) float64 152B ...
    ships_cyclone_id_string (ships_storm_object_index) object 152B ...
    ships_storm_latitude_deg_n (ships_storm_object_index) float64 152B ...
    ships_storm_longitude_deg_e (ships_storm_object_index) float64 152B ...
    ships_predictors_lagged (ships_valid_time_unix_sec, ships_lag_time_hours, ships_predictor_name_lagged) float64 13kB ...
    ships_predictors_forecast (ships_valid_time_unix_sec, ships_forecast_hour, ships_predictor_name_forecast) float64 451kB ...

    Variable names are meant to be as self-explanatory as possible. Potentially confusing ones are listed below.

    • The dimension ships_storm_object_index is redundant with the dimension ships_valid_time_unix_sec and can be ignored.
    • ships_forecast_hour ranges up to values that we do not actually use in the paper. Keep in mind that our max forecast hour used in machine learning is 24.
    • The dimension ships_intensity_threshold_m_s01 (and any variable including this dimension) can be ignored.
    • ships_lag_time_hours corresponds to lag times for the SHIPS satellite-based predictors. The only lag time we use in machine learning is "NaN", which is a stand-in for the best available of all lag times. See the discussion of the "priority list" in the paper for more details.
    • Most of the data variables can be ignored, unless you're doing a deep dive into storm properties. The important variables are satellite_predictors_gridded (full satellite images), ships_predictors_lagged (satellite-based SHIPS predictors), and ships_predictors_forecast (environmental and storm-history-based SHIPS predictors). These variables are all discussed in the paper.
    • Every variable name (including elements of the coordinate lists ships_predictor_name_lagged and ships_predictor_name_forecast) includes units at the end. For example, "m_s01" = metres per second; "deg_n" = degrees north; "deg_e" = degrees east; "j_kg01" = Joules per kilogram; ...; etc.
  9. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  10. DEMANDE Dataset

    • zenodo.org
    • researchdiscovery.drexel.edu
    zip
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph A. Gallego-Mejia; Joseph A. Gallego-Mejia; Fabio A Gonzalez; Fabio A Gonzalez (2023). DEMANDE Dataset [Dataset]. http://doi.org/10.5281/zenodo.7822851
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joseph A. Gallego-Mejia; Joseph A. Gallego-Mejia; Fabio A Gonzalez; Fabio A Gonzalez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the features and probabilites of ten different functions. Each dataset is saved using numpy arrays. \item The data set \textit{Arc} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\mathcal{N}(x_2|0,4)\mathcal{N}(x_1|0.25x_2^2,1)$$ where $$\mathcal{N}(u|\mu,\sigma^2)$$ denotes the density function of a normal distribution with mean $$\mu$$ and variance $$\sigma^2$$. \cite{Papamakarios2017} used this data set to evaluate his neural density estimation methods. \item The data set \textit{Potential 1} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\frac{1}{2}\left(\frac{||x||-2}{0.4}\right)^2 - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_1-2}{0.6}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_1+2}{0.6}\right]^2\right\}\right)}$$ with a normalizing constant of approximately 6.52 calculated by Monte Carlo integration. \item The data set \textit{Potential 2} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\frac{1}{2}\left[ \frac{x_2-w_1(x)}{0.4}\right]^2$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$ with a normalizing constant of approximately 8 calculated by Monte Carlo integration. \item The data set \textit{Potential 3} corresponds to a two-dimensional random sample drawn from a random vector $$x=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)= - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)}{0.35}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)+w_2(x)}{0.35}^2\right]\right\}\right)}$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$ and $$w_2(x)=3 \exp \left\{-\frac{1}{2}\left[ \frac{x_1-1}{0.6}\right]^2\right\}$$ with a normalizing constant of approximately 13.9 calculated by Monte Carlo integration. \item The data set \textit{Potential 4} corresponds to a two-dimensional random sample drawn from a random vector $$x=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)= - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)}{0.4}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)+w_3(x)}{0.35}^2\right]\right\}\right)}$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$, $$w_3(x)=3 \sigma \left(\left[ \frac{x_1-1}{0.3}\right]^2\right)$$, and $$\sigma(x)= \frac{1}{1+\exp(x)}$$ with a normalizing constant of approximately 13.9 calculated by Monte Carlo integration. \item The data set \textit{2D mixture} corresponds to a two-dimensional random sample drawn from the random vector $$x=(X_1, X_2)$$ with a probability density function given by $$f(x) = \frac{1}{2}\mathcal{N}(x|\mu_1,\Sigma_1) + \frac{1}{2}\mathcal{N}(x|\mu_2,\Sigma_2)$$ with means and covariance matrices $$\mu_1 = [1, -1]^T$$, $$\mu_2 = [-2, 2]^T$$, $$\Sigma_1=\left[\begin{array}{cc} 1 & 0 \\ 0 & 2 \end{array}\right]$$, and $$\Sigma_1=\left[\begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array}\right]$$ \item The data set \textit{10D-mixture} corresponds to a 10-dimensional random sample drawn from the random vector $$x=(X_1,\cdots,X_{10})$$ with a mixture of four diagonal normal probability density functions $$\mathcal{N}(X_i|\mu_i, \sigma_i)$$, where each $$\mu_i$$ is drawn uniformly in the interval $$[-0.5,0.5]$$, and the $$\sigma_i$$ is drawn uniformly in the interval $$[-0.01, 0.5]$$. Each diagonal normal probability density has the same probability of being drawn $$1/4$$.

  11. w

    SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations...

    • data.wu.ac.at
    • researchdata.edu.au
    • +1more
    zip
    Updated Sep 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Programme (2017). SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations in the Namoi subregion [Dataset]. https://data.wu.ac.at/odso/data_gov_au/YTBlODZlMzItYjA5ZC00NDNjLTk1NDUtNTAzNjE3YjhkNTMy
    Explore at:
    zip(15069668.0)Available download formats
    Dataset updated
    Sep 29, 2017
    Dataset provided by
    Bioregional Assessment Programme
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    SILO is a Queensland Government database containing continuous daily climate data for Australia from 1889 to present. Gridded datasets are constructed by spatially interpolating the observed point data. Continuous point datasets are constructed by supplementing the available point data with interpolated estimates when observed data are missing.

    Purpose

    SILO provides climate datasets that are ready to use. Raw observational data typically contain missing data and are only available at the location of meteorological recording stations. SILO provides point datasets with no missing data and gridded datasets which cover mainland Australia and some islands.

    Dataset History

    Lineage statement:

    (A) Processing System Version History

    * Prior to 2001

    The interpolation system used the algorithm detailed in Jeffrey et al.1

    * 2001-2009

    The normalisation procedure was modified. Observational rainfall, when accumulated over a sufficient period and raised to an appropriate fractional power, is (to a reasonable approximation) normally distributed. In the original procedure the fractional power was fixed at 0.5 and a normal distribution was fitted to the transformed data using a maximum likelihood technique. A Kolmogorov-Smirnov test was used to test the goodness of fit, with a threshold value of 0.8. In 2001 the procedure was modified to allow the fractional power to vary between 0.4 and 0.6. The normalisation parameters (fractional power, mean and standard deviation) at each station were spatially interpolated using a thin plate smoothing spline.

    * 2009-2011

    The normalisation procedure was modified. The Kolmogorov-Smirnov test was removed, enabling normalisation parameters to be computed for all stations having sufficient data. Previously parameters were only computed for those stations having data that were adequately modelled by a normal distribution, as determined by the Kolmogorov-Smirnov test.

    * January 2012 - November 2012

    The normalisation procedure was modified:

    o The Kolmogorov-Smirnoff test was reintroduced, with a threshold value of 0.1.

    o Data from Bellenden Ker Top station were included in the computation of normalisation parameters. The station was previously omitted on the basis of having insufficient data. It was forcibly included to ensure the steep rainfall gradient in the region was reflected in the normalisation parameters.

    o The elevation data used when interpolating normalisation parameters were modified. Previously a mean elevation was assigned to each station, taken from the nearest grid cell in a 0.05° 0.05° digital elevation model. The procedure was modified to use the actual station elevation instead of the mean. In mountainous regions the discrepancy was substantial and cross validation tests showed a significant improvement in error statistics.

    o The station data are normalised using: (i) a power parameter extracted from the nearest pixel in the gridded power surface. The surface was obtained by interpolating the power parameters fitted at station locations using a maximum likelihood algorithm; and (ii) mean and standard deviation parameters which had been fitted at station locations using a smoothing spline. Mean and standard deviation parameters were fitted at the subset of stations having at least 40 years of data, using a maximum likelihood algorithm. The fitted data were then spatially interpolated to construct: (a) gridded mean and standard deviation surfaces (for use in a subsequent de-normalisation procedure); and (b) interpolated estimates of the parameters at all station locations (not just the subset having long data records). The parameters fitted using maximum likelihood (at the subset of stations having long data records) may differ from those fitted by the interpolation algorithm, owing to the smoothing nature of the spline algorithm which was used. Previously, station data were normalised using mean and standard deviation parameters which were taken from the nearest pixel in the respective mean and standard deviation surfaces.

    * November 2012 - May 2013

    The algorithm used for selecting monthly rainfall data for interpolation was modified. Prior to November 2012, the system was as follows:

    o Accumulated monthly rainfall was computed by the Bureau of Meteorology;

    o Rainfall accumulations spanning the end of a month were assigned to the last month included in the accumulation period;

    o A monthly rainfall value was provided for all stations which submitted at least one daily report. Zero rainfall was assumed for all missing values; and

    o SILO imposed a complex set of ad-hoc rules which aimed to identify stations which had ceased reporting in real time. In such cases it would not be appropriate to assume zero rainfall for days when a report was not available. The rules were only applied when processing data for January 2001 and onwards.

    In November 2012 a modified algorithm was implemented:

    o SILO computed the accumulated monthly rainfall by summing the daily reports;

    o Rainfall accumulations spanning the end of a month were discarded;

    o A monthly rainfall value was not computed for a given station if any day throughout the month was not accounted for - either through a daily report or an accumulation; and

    o The SILO ad-hoc rules were not applied.

    * May 2013 - current

    The algorithm used for selecting monthly rainfall data for interpolation was modified. The modified algorithm is only applied to datasets for the period October 2001 - current and is as follows:

    o SILO computes the accumulated monthly rainfall by summing the daily reports;

    o Rainfall accumulations spanning the end of a month are pro-rata distributed onto the two months included in the accumulation period;

    o A monthly rainfall value is computed for all stations which have at least 21 days accounted for throughout the month. Zero rainfall is assumed for all missing values; and

    o The SILO ad-hoc rules are applied when processing data for January 2001 and onwards.

    Datasets for the period January 1889-September 2001 are prepared using the system that was in effect prior to November 2012.

    Lineage statement:

    (A) Processing System Version History

    No changes have been made to the processing system since SILO's inception.

    (B) Major Historical Data Updates

    * All observational data and station coordinates were updated in 2009.

    * Station coordinates were updated on 26 January 2012.

    Process step:

    The observed data are interpolated using a tri-variate thin plate smoothing spline, with latitude, longitude and elevation as independent variables.4 A two-pass interpolation system is used. All available observational data are interpolated in the first pass and residuals computed for all data points. The residual is the difference between the observed and interpolated values. Data points with high residuals may be indicative of erroneous data and are excluded from a subsequent interpolation which generates the final gridded surface. The surface covers the region 112˚E - 154˚E, 10˚S - 44˚S on a regular 0.05˚ × 0.05˚grid and is restricted to land areas on mainland Australia and some islands.

    Gridded datasets for the period 1957-current are obtained by interpolation of the raw data.

    Gridded datasets for the period 1957-current are obtained by interpolation of the raw data. Gridded datasets for the period 1889-1956 were constructed using an anomaly interpolation technique. The daily departure from the long term mean is interpolated, and the gridded dataset is constructed by adding the gridded anomaly to the gridded long term mean. The long term means were constructed using data from the period 1957-2001. The anomaly interpolation technique is described in Rayner et al.6

    The observed and interpolated datasets evolve as new data becomes available and the existing data are improved through quality control procedures. Modifications gradually decrease over time, with most datasets undergoing little change 12 months after the date of observation.

    Dataset Citation

    "Queensland Department of Science, Information Technology, Innovation and the Arts" (2013) SILO Patched Point data for Narrabri (54120) and Gunnedah (55023) stations in the Namoi subregion. Bioregional Assessment Source Dataset. Viewed 29 September 2017, http://data.bioregionalassessments.gov.au/dataset/0a018b43-58d3-4b9e-b339-4dae8fd54ce8.

  12. d

    Data from: Codebook vectors and predicted rare earth potential from a...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Codebook vectors and predicted rare earth potential from a trained emergent self-organizing map displaying multivariate topology of geochemical and reservoir temperature data from produced and geothermal waters of the United States [Dataset]. https://catalog.data.gov/dataset/codebook-vectors-and-predicted-rare-earth-potential-from-a-trained-emergent-self-organizin
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Earth, United States
    Description

    This data release consists of three products relating to a 82 x 50 neuron Emergent Self-Organizing Map (ESOM), which describes the multivariate topology of reservoir temperature and geochemical data for 190 samples of produced and geothermal waters from across the United States. Variables included in the ESOM are coordinates derived from reservoir temperature and concentration of Sc, Nd, Pr, Tb, Lu, Gd, Tm, Ce, Yb, Sm, Ho, Er, Eu, Dy, F, alkalinity as bicarbonate, Si, B, Br, Li, Ba, Sr, sulfate, H (derived from pH), K, Mg, Ca, Cl, and Na converted to units of proportion. The concentration data were converted to isometric log-ratio coordinates (following Hron et al., 2010), where the first ratio is Sc serving as the denominator to the geometric mean of all of the remaining elements (Nd to Na), the second ratio is Nd serving as the denominator by the geometric mean of all of the remaining elements (Pr to Na), and so on, until the final ratio is Na to Cl. Both the temperature and log-ratio coordinates of the concentration data were normalized to a mean of zero and a sample standard deviation of one. The first table is the mean and standard deviation of all of the data in this dataset, which is used to standardize the data. The second table is the codebook vectors from the trained ESOM where all variables were standardized and compositional data converted to isometric log-ratios. The final tables provides are rare earth element potentials predicted for a subset of the U.S. Geological Survey Produced Waters Geochemical Database, Version 2.3 (Blondes et al., 2017) through the used of the ESOM. The original source data used to create the ESOM all come from the U.S. Department of Energy Resources Geothermal Data Repository and are detailed in Engle (2019).

  13. E

    Data from: CsEnVi Pairwise Parallel Corpora

    • live.european-language-grid.eu
    binary format
    Updated Nov 9, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). CsEnVi Pairwise Parallel Corpora [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1106
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 9, 2015
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:

    - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents.

    The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series.

    The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.

    - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.

    The size of the original corpora collected from OPUS and TED talks is as follows:

    CS/VI EN/VI

    Sentence 1337199/1337199 2035624/2035624

    Word 9128897/12073975 16638364/17565580

    Unique word 224416/68237 91905/78333

    We improve the quality of the corpora in two steps: normalizing and filtering.

    In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly.

    In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.

    The size of cleaned corpora as published is as follows:

    CS/VI EN/VI

    Sentence 1091058/1091058 1113177/1091058

    Word 6718184/7646701 8518711/8140876

    Unique word 195446/59737 69513/58286

    The corpora are used as training data in [2].

    References:

    [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey.

    [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015

  14. D

    Open Architecture Security Data Platform Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Open Architecture Security Data Platform Market Research Report 2033 [Dataset]. https://dataintelo.com/report/open-architecture-security-data-platform-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Open Architecture Security Data Platform Market Outlook



    According to our latest research, the Open Architecture Security Data Platform market size reached USD 8.9 billion in 2024 at a robust growth rate, underpinned by the increasing demand for interoperable and scalable security solutions. The market is projected to expand at a CAGR of 14.2% from 2025 to 2033, reaching a forecasted value of USD 28.3 billion by 2033. This growth trajectory is largely driven by the rising sophistication of cyber threats and a global shift towards integrated, flexible security frameworks that can seamlessly accommodate evolving enterprise needs.




    The primary growth factors fueling the Open Architecture Security Data Platform market include the escalating complexity and frequency of cyberattacks across industries, compelling organizations to adopt advanced, open, and modular security infrastructures. Traditional monolithic security systems are increasingly being replaced by open architecture platforms that enable seamless integration with diverse security tools and data sources. This transition is critical for organizations aiming to achieve real-time threat detection, rapid incident response, and comprehensive compliance management. The proliferation of cloud-based applications, IoT devices, and remote workforces has further intensified the need for platforms capable of ingesting, normalizing, and analyzing heterogeneous security data at scale.




    Another significant driver is the global regulatory landscape, which is evolving rapidly to address the growing risks associated with digital transformation. Stringent data protection regulations such as GDPR, CCPA, and other regional mandates are pushing enterprises to adopt security data platforms that offer robust compliance management and auditable workflows. Open architecture solutions are particularly well-suited to this environment, as they provide the flexibility to integrate with compliance tools, automate reporting, and ensure data integrity across disparate environments. This regulatory pressure is especially pronounced in sectors like BFSI, healthcare, and government, where data sensitivity is paramount.




    The increasing adoption of advanced analytics, artificial intelligence, and machine learning within security operations is another catalyst for market expansion. Open architecture security data platforms are uniquely positioned to leverage these technologies, enabling organizations to derive actionable insights from vast volumes of security telemetry. By supporting interoperability with a wide array of analytics engines and data lakes, these platforms empower security teams to proactively identify threats, streamline incident response, and reduce mean time to detect and remediate breaches. As enterprises continue to prioritize security automation and intelligence-driven defense strategies, the demand for flexible, open, and extensible security data platforms will only intensify.




    From a regional perspective, North America currently dominates the Open Architecture Security Data Platform market due to its advanced cybersecurity ecosystem, high concentration of large enterprises, and early adoption of open security frameworks. However, Asia Pacific is emerging as the fastest-growing region, propelled by rapid digitalization, expanding regulatory requirements, and increasing investments in cybersecurity infrastructure. Europe also represents a significant market, driven by strict data protection laws and a strong focus on privacy and compliance. Latin America and the Middle East & Africa are witnessing steady growth as organizations in these regions ramp up their cybersecurity capabilities to counter rising threat levels and meet regulatory mandates.



    Component Analysis



    The Component segment of the Open Architecture Security Data Platform market is divided into software, hardware, and services, each playing a pivotal role in delivering comprehensive security solutions. Software remains the largest contributor, comprising advanced security analytics engines, data integration tools, and orchestration platforms that form the backbone of open architecture security frameworks. These software solutions are designed to provide modularity, interoperability, and scalability, allowing organizations to seamlessly integrate third-party tools and adapt to evolving security requirements. As cyber threats grow more sophisticated, the demand for feature-rich, customizable software platforms

  15. Left ventricular mass is underestimated in overweight children because of...

    • plos.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hubert Krysztofiak; Marcel Młyńczak; Łukasz A. Małek; Andrzej Folga; Wojciech Braksator (2023). Left ventricular mass is underestimated in overweight children because of incorrect body size variable chosen for normalization [Dataset]. http://doi.org/10.1371/journal.pone.0217637
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hubert Krysztofiak; Marcel Młyńczak; Łukasz A. Małek; Andrzej Folga; Wojciech Braksator
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundLeft ventricular mass normalization for body size is recommended, but a question remains: what is the best body size variable for this normalization—body surface area, height or lean body mass computed based on a predictive equation? Since body surface area and computed lean body mass are derivatives of body mass, normalizing for them may result in underestimation of left ventricular mass in overweight children. The aim of this study is to indicate which of the body size variables normalize left ventricular mass without underestimating it in overweight children.MethodsLeft ventricular mass assessed by echocardiography, height and body mass were collected for 464 healthy boys, 5–18 years old. Lean body mass and body surface area were calculated. Left ventricular mass z-scores computed based on reference data, developed for height, body surface area and lean body mass, were compared between overweight and non-overweight children. The next step was a comparison of paired samples of expected left ventricular mass, estimated for each normalizing variable based on two allometric equations—the first developed for overweight children, the second for children of normal body mass.ResultsThe mean of left ventricular mass z-scores is higher in overweight children compared to non-overweight children for normative data based on height (0.36 vs. 0.00) and lower for normative data based on body surface area (-0.64 vs. 0.00). Left ventricular mass estimated normalizing for height, based on the equation for overweight children, is higher in overweight children (128.12 vs. 118.40); however, masses estimated normalizing for body surface area and lean body mass, based on equations for overweight children, are lower in overweight children (109.71 vs. 122.08 and 118.46 vs. 120.56, respectively).ConclusionNormalization for body surface area and for computed lean body mass, but not for height, underestimates left ventricular mass in overweight children.

  16. DDSP EMG dataset.xlsx

    • commons.datacite.org
    • figshare.com
    Updated Jul 14, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Cercone (2019). DDSP EMG dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.8864411
    Explore at:
    Dataset updated
    Jul 14, 2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Figsharehttp://figshare.com/
    Authors
    Marta Cercone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study was performed in accordance with the PHS Policy on Humane Care and Use of Laboratory Animals, federal and state regulations, and was approved by the Institutional Animal Care and Use Committees (IACUC) of Cornell University and the Ethics and Welfare Committee at the Royal Veterinary College.Study design: adult horses were recruited if in good health and following evaluation of the upper airways through endoscopic exam, at rest and during exercise, either overground or on a high-speed treadmill using a wireless videoendoscope. Horses were categorized as “DDSP” affected horses if they presented with exercise-induced intermittent dorsal displacement of the soft palate consistently during multiple (n=3) exercise tests, or “control” horses if they did not experience dorsal displacement of the soft palate during exercise and had no signs compatible with DDSP like palatal instability during exercise, soft palate or sub-epiglottic ulcerations. Horses were instrumented with intramuscular electrodes, in one or both thyro-hyoid muscles for EMG recording, hard wired to a wireless transmitter for remote recording implanted in the cervical area. EMG recordings were then made during an incremental exercise test based on the percentage of maximum heart rate (HRmax). Incremental Exercise Test After surgical instrumentation, each horse performed a 4-step incremental test while recording TH electromyographic activity, heart rate, upper airway videoendoscopy, pharyngeal airway pressures, and gait frequency measurements. Horses were evaluated at exercise intensities corresponding to 50, 80, 90 and 100% of their maximum heart rate with each speed maintained for 1 minute. aryngeal function during the incremental test was recorded using a wireless videoendoscope (Optomed, Les Ulis, France), which was placed into the nasopharynx via the right ventral nasal meatus. Nasopharyngeal pressure was measured using a Teflon catheter (1.3 mm ID, Neoflon) inserted through the left ventral nasal meatus to the level of the left guttural pouch ostium. The catheter was attached to differential pressure transducers (Celesco LCVR, Celesco Transducers Products, Canoga Park, CA, USA) referenced to atmospheric pressure and calibrated from -70 to 70 mmHg. Occurrence of episodes of dorsal displacement of the soft palate was recorded and number of swallows during each exercise trials were counted for each speed interval.
    EMG recordingEMG data was recorded through a wireless transmitter device implanted subcutaneously. Two different transmitters were used: 1) TR70BB (Telemetry Research Ltd, Auckland, New Zealand) with 12bit A/D conversion resolution, AC coupled amplifier, -3dB point at 1.5Hz, 2KHz sampling frequency (n=5 horses); or 2) ELI (Center for Medical Physics and Biomedical Engineering, Medical University of Vienna, Vienna, Austria) [23], with 12bit A/D conversion resolution, AC coupled amplifier, amplifier gain 1450, 1KHz sampling frequency (n=4 horses). The EMG signal was transmitted through a receiver (TR70BB) or Bluetooth (ELI) to a data acquisition system (PowerLab 16/30 - ML880/P, ADInstruments, Bella Vista, Australia). The EMG signal was amplified with octal bio-amplifier (Octal Bioamp, ML138, ADInstruments, Bella Vista, Australia) with a bandwidth frequency ranging from 20-1000 Hz (input impedance = 200 MV, common mode rejection ratio = 85 dB, gain = 1000), and transmitted to a personal computer. All EMG and pharyngeal pressure signals were collected at 2000 Hz rate with LabChart 6 software (ADInstruments, Bella Vista, Australia) that allows for real-time monitoring and storage for post-processing and analysis.
    EMG signal processingElectromyographic signals from the TH muscles were processed using two methods; 1) a classical approach to myoelectrical activity and median frequency and 2) wavelet decomposition. For both methods, the beginning and end of recording segments including twenty consecutive breaths, at the end of each speed interval, were marked with comments in the acquisition software (LabChart). The relationship of EMG activity with phase of the respiratory cycle was determined by comparing pharyngeal pressure waveforms with the raw EMG and time-averaged EMG traces.For the classical approach, in a graphical user interface-based software (LabChart), a sixth-order Butterworth filter was applied (common mode rejection ratio, 90 dB; band pass, 20 to 1,000 Hz), the EMG signal was then amplified, full-wave rectified, and smoothed using a triangular Bartlett window (time constant: 150ms). The digitized area under the time-averaged full-wave rectified EMG signal was calculated to define the raw mean electrical activity (MEA) in mV.s. Median Power Frequency (MF) of the EMG power spectrum was calculated after a Fast Fourier Transformation (1024 points, Hann cosine window processing). For the wavelet decomposition, the whole dataset including comments and comment locations was exported as .mat files for processing in MATLAB R2018a with the Signal Processing Toolbox (The MathWorks Inc, Natick, MA, USA). A custom written automated script based on Hodson-Tole & Wakeling [24] was used to first cut the .mat file into the selected 20 breath segments and subsequently process each segment. A bank of 16 wavelets with time and frequency resolution optimized for EMG was used. The center frequencies of the bank ranged from 6.9 Hz to 804.2 Hz [25]. The intensity was summed (mV2) to a total, and the intensity contribution of each wavelet was calculated across all 20 breaths for each horse, with separate results for each trial date and exercise level (80, 90, 100% of HRmax as well as the period preceding episodes of DDSP). To determine the relevant bandwidths for the analysis, a Fast Fourier transform frequency analysis was performed on the horses unaffected by DDSP from 0 to 1000 Hz in increments of 50Hz and the contribution of each interval was calculated in percent of total spectrum as median and interquartile range. According to the Shannon-Nyquist sampling theorem, the relevant signal is below ½ the sample rate and because we had instrumentation sampling either 1000Hz and 2000Hz we choose to perform the frequency analysis up to 1000Hz. The 0-50Hz interval, mostly stride frequency and background noise, was excluded from further analysis. Of the remaining frequency spectrum, we included all intervals from 50-100Hz to 450-500Hz and excluded the remainder because they contributed with less than 5% to the total amplitude.Data analysisAt the end of each exercise speed interval, twenty consecutive breaths were selected and analyzed as described above. To standardize MEA, MF and mV2 within and between horses and trials, and to control for different electrodes size (i.e. different impedance and area of sampling), data were afterward normalized to 80% of HRmax value (HRmax80), referred to as normalized MEA (nMEA), normalized MF (nMF) and normalized mV2 (nmV2). During the initial processing, it became clear that the TH muscle is inconsistently activated at 50% of HRmax and that speed level was therefore excluded from further analysis. The endoscopy video was reviewed and episodes of palatal displacement were marked with comments. For both the classical approach and wavelet analysis, an EMG segment preceding and concurrent to the DDSP episode was analyzed. If multiple episodes were recorded during the same trial, only the period preceding the first palatal displacement was analyzed. In horses that had both TH muscles implanted, the average between the two sides was used for the analysis. Averaged data from multiple trials were considered for each horse. Descriptive data are expressed as means with standard deviation (SD). Normal distribution of data was assessed using the Kolmogorov-Smirnov test and quantile-quantile (Q-Q) plot. To determine the frequency clusters in the EMG signal, a hierarchical agglomerative dendrogram was applied using the packages Matplotlib, pandas, numpy and scipy in python (version 3.6.6) executed through Spyder (version 3.2.2) and Anaconda Navigator. Based on the frequency analysis, wavelets included in the cluster analysis were 92.4 Hz, 128.5 Hz, 170.4 Hz, 218.1 Hz, 271.5 Hz, 330.6 Hz, 395.4 Hz and 465.9 Hz. The number of frequency clusters was set to two based on maximum acceleration in a scree plot and maximum vertical distance in the dendrogram. For continuous outcome measures (number of swallows, MEA, MF, and mV2) a mixed effect model was fitted to the data to determine the relationship between the outcome variable and relevant fixed effects (breed, sex, age, weight, speed, group) using horse as a random effect. Tukey’s post hoc tests and linear contrasts used as appropriate. Statistical analysis was performed using JMP Pro13 (SAS Institute, Cary, NC, USA). Significance set at P < 0.05 throughout.

  17. Normalised Difference Vegetation Index Statistics (Long Term 1999-2019 /...

    • data.europa.eu
    netcdf
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Commission's Joint Research Centre (2025). Normalised Difference Vegetation Index Statistics (Long Term 1999-2019 / Short Term 2015-2019) (raster 1 km), global, 10-daily - version 3 [Dataset]. https://data.europa.eu/data/datasets/290e81fb-4c84-42ad-ae12-f663312b0eda?locale=sv
    Explore at:
    netcdfAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset provided by
    Joint Research Centrehttps://joint-research-centre.ec.europa.eu/index_en
    European Commissionhttp://ec.europa.eu/
    Authors
    European Commission's Joint Research Centre
    Description

    The Normalised Difference Vegetation Index (NDVI) is a widely used, dimensionless index that is indicative for vegetation density and is defined as NDVI=(NIR-Red)/(NIR+Red) where NIR corresponds to the reflectance in the near infrared bands, and Red to the reflectance in the red bands. The time series of 10-daily NDVI 1km version 3 observations is used to calculate Long Term Statistics (LTS), over the 20-year period 1999-2019, and Short Term Statistics (STS), over the 5-year period (2015-2019), for each of the 36 10-daily periods (dekads) of the year. The calculated statistics include the minimum, median, maximum, mean, standard deviation and the number of observations in the covered time series period. These statistics can be used as a reference for actual NDVI observations, which allows monitoring of anomalous vegetation conditions.

  18. Benchmark dataset for agricultural KGML model development with PyKGML

    • zenodo.org
    bin
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yufeng Yang; Yufeng Yang; LICHENG LIU; LICHENG LIU (2025). Benchmark dataset for agricultural KGML model development with PyKGML [Dataset]. http://doi.org/10.5281/zenodo.15883492
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yufeng Yang; Yufeng Yang; LICHENG LIU; LICHENG LIU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 9, 2025
    Description

    This benchmark dataset works as the demonstrative data in the testing of PyKGML, the Python library for the efficient development of knowledge-guided machine learning (KGML) models.

    The dataset are developed using agroecosystem data in the two KGML studies:

    1. "KGML-ag: A Modeling Framework of Knowledge-Guided Machine Learning to Simulate Agroecosystems: A Case Study of Estimating N2O Emission using Data from Mesocosm Experiments".
    Licheng Liu, Shaoming Xu, Zhenong Jin*, Jinyun Tang, Kaiyu Guan, Timothy J. Griffis, Matt D. Erickson, Alexander L. Frie, Xiaowei Jia, Taegon Kim, Lee T. Miller, Bin Peng, Shaowei Wu, Yufeng Yang, Wang Zhou, Vipin Kumar.

    2. "Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems".

    Licheng Liu, Wang Zhou, Kaiyu Guan, Bin Peng, Shaoming Xu, Jinyun Tang, Qing Zhu, Jessica Till, Xiaowei Jia, Chongya Jiang, Sheng Wang, Ziqi Qin, Hui Kong, Robert Grant, Symon Mezbahuddin, Vipin Kumar, Zhenong Jin.

    All the files belong to Dr. Licheng Liu, University of Minnesota. lichengl@umn.edu
    There are two parts in this dataset, the CO2 data from study 1 and the N2O data from study 2, both contain a pre-training subset and a fine-tuning subset. Data descriptions are as follows:
    1. CO2 dataset:
    • Synthetic data of ecosys:
      • - 100 simulations at random corn fields in the Midwest.
      • - Daily sequences over 18 years (2000-2018).
    • Field observations:
      • Eddy-covariance observations from 11 flux towers in the Midwest.
      • A total of 102 site-years of daily sequences.
    • Input variables (19):
      • Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), (max-min) air T (TDIF_AIR), max air humidity (HMAX_AIR), (max-min) air humidity (HDIF_AIR), wind speed (WIND), precipitation (PRECN).
      • Soil properties (9): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), field capacity (TFC), wilting point (TWP), saturate hydraulic conductivity (TKSat), soil organic carbon concetration (TSOC), pH (TPH), cation exchange capacity (TCEC)
      • Other (3): year (Year), crop type (Crop_Type), gross primary productivity (GPP)
    • Output variables (3):
      • Autotrophic respiration (Ra), heterotrophic respiration (Rh), net ecosystem exchange (NEE).
    2. N2O dataset:
    • Synthetic data of ecosys:
      • 1980 simulations at 99 counties x 20 N-fertilizer rates in the 3I states (Illinois, Iowa, Indiana).
      • Daily sequences over 18 years (2000-2018).
    • Field observations:
      • 6 chamber observations in a mesocosm environment facility at the University of Minnesota.
      • Daily sequences of 122 days x 3 years (2016-2018) x 1000 augmentations from hourly data at each chamber.
    • input variables (16):
      • Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), min air T (TMIN_AIR), max air humidity (HMAX_AIR), min air humidity (HMIN_AIR), wind speed (WIND), precipitation (PRECN).
      • Soil properties (6): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), pH (TPH), cation exchange capacity (TCEC), soil organic carbon concetration (TSOC)
      • Management (3): N-fertilizer rate (FERTZR_N), planting day of year (PDOY), crop type (PLANTT).
    • Output variables (3):
      • Soil N2O fluxes (N2O_FLUX), soil CO2 fluxes (CO2_FLUX), soil water content at 10 cm (WTR_3), soil ammonium concentration at 10 cm (NH4_3), soil nitrate concentration at 10 cm (NO3_3).
    Each file is a serialized Python dictionary containing the following keys and values:

    data={'X_train': X_train,
    'X_test': X_test,
    'Y_train': Y_train,
    'Y_test': Y_test,
    'x_scaler': x_scaler,
    'y_scaler': y_scaler,
    'input_features': input_features,
    'output_features': output_features}
    • X_train, X_test: Feature matrices for training and testing.

    • Y_train, Y_test: Target values for training and testing.

    • x_scaler: The scaler (mean, std) used for normalizing input features.
    • y_scaler: The scaler (mean, std) used for normalizing output features.

    • input_features: A list of input feature names.

    • output_features: A list of output feature names.

    Please download and use the latest version of this dataset, as it contains important updates.

    Contact: Dr. Licheng Liu (lichengl@umn.edu), Dr. Yufeng Yang (yang6956@umn.edu)

  19. DAQUAR Dataset (Processed) for VQA

    • kaggle.com
    zip
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tezan Sahu (2022). DAQUAR Dataset (Processed) for VQA [Dataset]. https://www.kaggle.com/datasets/tezansahu/processed-daquar-dataset/code
    Explore at:
    zip(430733804 bytes)Available download formats
    Dataset updated
    Jan 19, 2022
    Authors
    Tezan Sahu
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The first significant Visual Question Answering (VQA) dataset was the DAtaset for QUestion Answering on Real-world images (DAQUAR). It contains 6794 training and 5674 test question-answer pairs, based on images from the NYU-Depth V2 Dataset. That means about 9 pairs per image on average.

    This dataset is a processed version of the Full DAQUAR Dataset where the questions have been normalized (for easier consumption by tokenizers) & the image IDs, questions & answers are stored in a tabular (CSV) format, which can be loaded & used as-is for training VQA models.

    Content

    This dataset contains the processed DAQUAR Dataset (full), along with some of the raw files from the original dataset.

    Processed data: - data.csv: This is the processed dataset after normalizing all the questions & converting the {question, answer, image_id} data into a tabular format for easier consumption. - data_train.csv: This contains those records from data.csv which correspond to images present in train_images_list.txt - data_eval.csv: This contains those records from data.csv which correspond to images present in test_images_list.txt - answer_space.txt: This file contains a list of all possible answers extracted from all_qa_pairs.txt (This will allow the VQA task to be modelled as a multi-class classification problem)

    Raw files: - all_qa_pairs.txt - train_images_list.txt - test_images_list.txt

    Acknowledgements

    Malinowski, Mateusz, and Mario Fritz. "A multi-world approach to question answering about real-world scenes based on uncertain input." Advances in neural information processing systems 27 (2014): 1682-1690.

  20. f

    Data from: Advancing Fifth Percentile Hazard Concentration Estimation Using...

    • figshare.com
    • acs.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander K. Dhond; Mace G. Barron (2023). Advancing Fifth Percentile Hazard Concentration Estimation Using Toxicity-Normalized Species Sensitivity Distributions [Dataset]. http://doi.org/10.1021/acs.est.2c06857.s005
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Alexander K. Dhond; Mace G. Barron
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The species sensitivity distribution (SSD) is an internationally accepted approach to hazard estimation using the probability distribution of toxicity values that is representative of the sensitivity of a group of species to a chemical. Application of SSDs in ecological risk assessment has been limited by insufficient taxonomic diversity of species to estimate a statistically robust fifth percentile hazard concentration (HC5). We used the toxicity-normalized SSD (SSDn) approach, (Lambert, F. N.; Raimondo, S.; Barron, M. G. Environ. Sci. Technol.2022,56, 8278–8289), modified to include all possible normalizing species, to estimate HC5 values for acute toxicity data for groups of carbamate and organophosphorous insecticides. We computed mean and variance of single chemical HC5 values for each chemical using leave-one-out (LOO) variance estimation and compared them to SSDn and conventionally estimated HC5 values. SSDn-estimated HC5 values showed low uncertainty and high accuracy compared to single-chemical SSDs when including all possible combinations of normalizing species within the chemical-taxa grouping (carbamate-all species, carbamate-fish, organophosphate-fish, and organophosphate-invertebrate). The SSDn approach is recommended for estimating HC5 values for compounds with insufficient species diversity for HC5 computation or high uncertainty in estimated single-chemical HC5 values. Furthermore, the LOO variance approach provides SSD practitioners with a simple computational method to estimate confidence intervals around an HC5 estimate that is nearly identical to the conventionally estimated HC5.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v

Data from: A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Hospital for Sick Children
Universidade de São Paulo
University of Toronto
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

Search
Clear search
Close search
Google apps
Main menu