68 datasets found
  1. A comparison of per sample global scaling and per gene normalization methods...

    • plos.figshare.com
    pdf
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai (2023). A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0176185
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (

  2. The datasets used in this research.

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). The datasets used in this research. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

  3. f

    File S1 - Normalization of RNA-Sequencing Data from Samples with Varying...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Feb 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan (2014). File S1 - Normalization of RNA-Sequencing Data from Samples with Varying mRNA Levels [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001266682
    Explore at:
    Dataset updated
    Feb 25, 2014
    Authors
    Collas, Philippe; Rognes, Torbjørn; Aanes, Håvard; Winata, Cecilia; Moen, Lars F.; Aleström, Peter; Østrup, Olga; Mathavan, Sinnakaruppan
    Description

    Table S1 and Figures S1–S6. Table S1. List of primers. Forward and reverse primers used for qPCR. Figure S1. Changes in total and polyA+ RNA during development. a) Amount of total RNA per embryo at different developmental stages. b) Amount of polyA+ RNA per 100 embryos at different developmental stages. Vertical bars represent standard errors. Figure S2. The TMM scaling factor. a) The TMM scaling factor estimated using dataset 1 and 2. We observe very similar values. b) The TMM scaling factor obtained using the replicates in dataset 2. The TMM values are very reproducible. c) The TMM scale factor when RNA-seq data based on total RNA was used. Figure S3. Comparison of scales. We either square-root transformed or used that scales directly and compared the normalized fold-changes to RT-qPCR results. a) Transcripts with dynamic change pre-ZGA. b) Transcripts with decreased abundance post-ZGA. c) Transcripts with increased expression post-ZGA. Vertical bars represent standard deviations. Figure S4. Comparison of RT-qPCR results depending on RNA template (total or poly+ RNA) and primers (random or oligo(dT) primers) for setd3 (a), gtf2e2 (b) and yy1a (c). The increase pre-ZGA is dependent on template (setd3 and gtf2e2) and not primer type. Figure S5. Efficiency calibrated fold-changes for a subset of transcripts. Vertical bars represent standard deviations. Figure S6. Comparison normalization methods using dataset 2 for transcripts with decreased expression post-ZGA (a) and increased expression post-ZGA (b). Vertical bars represent standard deviations. (PDF)

  4. H

    Data and Code for: "Universal Adaptive Normalization Scale (AMIS):...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gennady Kravtsov (2025). Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" [Dataset]. http://doi.org/10.7910/DVN/BISM0N
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Gennady Kravtsov
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.

  5. Binary classification using a confusion matrix.

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). Binary classification using a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

  6. d

    Data from: Evaluation of normalization procedures for oligonucleotide array...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    • +1more
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls [Dataset]. https://catalog.data.gov/dataset/evaluation-of-normalization-procedures-for-oligonucleotide-array-data-based-on-spiked-crna
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Affymetrix oligonucleotide arrays simultaneously measure the abundances of thousands of mRNAs in biological samples. Comparability of array results is necessary for the creation of large-scale gene expression databases. The standard strategy for normalizing oligonucleotide array readouts has practical drawbacks. We describe alternative normalization procedures for oligonucleotide arrays based on a common pool of known biotin-labeled cRNAs spiked into each hybridization. Results We first explore the conditions for validity of the 'constant mean assumption', the key assumption underlying current normalization methods. We introduce 'frequency normalization', a 'spike-in'-based normalization method which estimates array sensitivity, reduces background noise and allows comparison between array designs. This approach does not rely on the constant mean assumption and so can be effective in conditions where standard procedures fail. We also define 'scaled frequency', a hybrid normalization method relying on both spiked transcripts and the constant mean assumption while maintaining all other advantages of frequency normalization. We compare these two procedures to a standard global normalization method using experimental data. We also use simulated data to estimate accuracy and investigate the effects of noise. We find that scaled frequency is as reproducible and accurate as global normalization while offering several practical advantages. Conclusions Scaled frequency quantitation is a convenient, reproducible technique that performs as well as global normalization on serial experiments with the same array design, while offering several additional features. Specifically, the scaled-frequency method enables the comparison of expression measurements across different array designs, yields estimates of absolute message abundance in cRNA and determines the sensitivity of individual arrays.

  7. The performance results for k-means clustering and testing the hypothesis...

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units.

  8. Gender Recognition by Voice(processed)

    • kaggle.com
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    murtadha najim (2025). Gender Recognition by Voice(processed) [Dataset]. https://www.kaggle.com/datasets/murtadhanajim/vocal-gender-features
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2025
    Dataset provided by
    Kaggle
    Authors
    murtadha najim
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is a cleaned and processed version of raw audio files for gender classification. The features were extracted from .wav audio recordings collected in a quiet room with no background noise. The data contains no null or duplicate values, ensuring a high-quality starting point for analysis and modeling.

    Features:

    The dataset includes the following extracted audio features:

    mean_spectral_centroid: The average spectral centroid, representing the "center of mass" of the spectrum, indicating brightness. std_spectral_centroid: The standard deviation of the spectral centroid, measuring variability in brightness. mean_spectral_bandwidth: The average width of the spectrum, reflecting how spread out the frequencies are. std_spectral_bandwidth: The standard deviation of spectral bandwidth, indicating variability in frequency spread. mean_spectral_contrast: The average difference between peaks and valleys in the spectrum, indicating tonal contrast. mean_spectral_flatness: The average flatness of the spectrum, measuring the noisiness of the signal. mean_spectral_rolloff: The average frequency below which a specified percentage of the spectral energy resides, indicating sharpness. zero_crossing_rate: The rate at which the signal crosses the zero amplitude axis, representing noisiness or percussiveness. rms_energy: The root mean square energy of the signal, reflecting its loudness. mean_pitch: The average pitch frequency of the audio. min_pitch: The minimum pitch frequency. max_pitch: The maximum pitch frequency. std_pitch: The standard deviation of pitch frequency, measuring variability in pitch. spectral_skew: The skewness of the spectral distribution, indicating asymmetry. spectral_kurtosis: The kurtosis of the spectral distribution, indicating the peakiness of the spectrum. energy_entropy: The entropy of the signal energy, representing its randomness. log_energy: The logarithmic energy of the signal, a compressed representation of energy. mfcc_1_mean to mfcc_13_mean: The mean of the first 13 Mel Frequency Cepstral Coefficients (MFCCs), representing the timbral characteristics of the audio. mfcc_1_std to mfcc_13_std: The standard deviation of the first 13 MFCCs, indicating variability in timbral features. label: The target variable indicating the gender male(1) or female(0).

    Key Information:

    Clean Data: The dataset has been thoroughly cleaned and contains no null or duplicate values. Unscaled: The features are not scaled, allowing users to apply their preferred scaling or normalization techniques. Feature Extraction: The function used for feature extraction is available in the notebook in the Code section. High Performance: The data achieved 95%+ accuracy using machine learning models such as Random Forest, Extra Trees, and K-Nearest Neighbors (KNN). It also performed exceptionally well with neural networks.

    Recommendations:

    Feature Selection: Avoid using all features in modeling to prevent overfitting. Instead, perform feature selection and choose the most impactful features based on your analysis.

    This processed dataset is a reliable and robust foundation for building high-performing models. and if you need any help, you can visit my notebook

  9. Indeed Job Postings

    • kaggle.com
    zip
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spandana Kalakonda (2023). Indeed Job Postings [Dataset]. https://www.kaggle.com/datasets/spandanakalakonda/job-postings
    Explore at:
    zip(6869322 bytes)Available download formats
    Dataset updated
    Jul 30, 2023
    Authors
    Spandana Kalakonda
    Description

    Job Postings on Indeed Overview: This dataset contains job listings from various industries and locations.

    Columns and Features: 1. title: The job title or position being offered. 2. company: The company or organization offering the job. 3. location: The address (city, region, zip) of where the job is located. 4. type: The work arrangement, categorized as "onsite," "remote," or "hybrid." 5. salary: The salary range associated with the job. 6. experience: The expected or required experience level for the job 7. contract_type: The type of employment, categorized as "fulltime" or "contract." 8. job_description: A summary of the job responsibilities and requirements. 9. sub_industry: The specific sub-industry or niche the job belongs to. 10. industry: The broader industry category to which the job belongs. 11. exp_normalized: The normalized experience level for easier comparison using min-max scaling. 12. type norm: The normalized work arrangement for easier comparison. 13. contract_type_norm: The normalized contract type for easier comparison using Z-score normalization. 14. salary_min: The minimum salary value in the salary range. 15. salary_max: The maximum salary value in the salary range. 16. salary_avg: The average salary value in the salary range. 17. avg_annual_salary: The calculated average annual salary based on the salary range.

    Acknowledgement: The dataset was sourced from Indeed, a popular job search board, The data was made available on Kaggle for educational and research purposes.

    https://www.indeed.com/

  10. NTU60 Processed Skeleton Dataset

    • kaggle.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oucherif Mohammed Ouail (2025). NTU60 Processed Skeleton Dataset [Dataset]. https://www.kaggle.com/datasets/oucherifouail/ntu60-processed-skeleton-dataset
    Explore at:
    zip(3075187118 bytes)Available download formats
    Dataset updated
    Aug 29, 2025
    Authors
    Oucherif Mohammed Ouail
    Description

    NTU RGB+D 60 – Preprocessed Skeleton Dataset

    This dataset provides preprocessed skeleton sequences from the NTU RGB+D 60 benchmark, widely used for skeleton-based human action recognition.

    The preprocessing module standardizes the raw NTU skeleton data to make it directly usable for training deep learning models.

    Preprocessing Steps

    Each skeleton sequence was processed by:

    • ✅ Removing NaN / invalid frames
    • ✅ Translating skeletons (centered spine base joint at origin)
    • ✅ Normalizing body scale using spine length
    • ✅ Aligning all sequences to 300 frames (padding or truncation)
    • ✅ Formatting sequences to include up to 2 persons per clip

    Output Files

    Two .npz files are provided, following the standard evaluation protocols:

    1. NTU60_CS.npz → Cross-Subject split
    2. NTU60_CV.npz → Cross-View split

    Each file contains:

    • x_train → Training data, shape (N_train, 300, 150)
    • y_train → Training labels, shape (N_train, 60) (one-hot)
    • x_test → Testing data, shape (N_test, 300, 150)
    • y_test → Testing labels, shape (N_test, 60) (one-hot)

    Data Format

    • 300 = max frames per sequence (zero-padded)
    • 150 = 2 persons × 25 joints × 3 coordinates (x, y, z)
    • 60 = number of action classes

    If a sequence has only 1 person, the second person’s features are zero-filled.

    Skeleton Properties

    • Centered → Spine base joint (joint-2) at origin (0,0,0)
    • Normalized → Body size scaled consistently
    • Aligned → Fixed-length sequences (300 frames)
    • Two-person setting → Always represented with 150 features

    Evaluation Protocols

    • Cross-Subject (CS): Train and test sets split by different actors. The model is evaluated on unseen subjects to measure generalization across people.
    • Cross-View (CV): Train and test sets split by different camera views. The model is evaluated on unseen viewpoints to measure viewpoint invariance.

    Usage

    These .npz files can be directly loaded in PyTorch or NumPy-based pipelines. They are fully compatible with graph convolutional networks (GCNs), transformers, and other deep learning models for skeleton-based action recognition.

    Example:

    import numpy as np
    
    data = np.load("NTU60_CS.npz")
    x_train, y_train = data["x_train"], data["y_train"]
    
    print(x_train.shape) # (N_train, 300, 150)
    print(y_train.shape) # (N_train, 60)
    
  11. d

    Mission and Vision Statements (Normalized)

    • search.dataone.org
    • datasetcatalog.nlm.nih.gov
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anez, Diomar; Anez, Dimar (2025). Mission and Vision Statements (Normalized) [Dataset]. http://doi.org/10.7910/DVN/SFKSW0
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Anez, Diomar; Anez, Dimar
    Description

    This dataset provides processed and normalized/standardized indices for the management tool group focused on 'Mission and Vision Statements', including related concepts like Purpose Statements. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Mission/Vision dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "mission statement" + "vision statement" + "mission and vision corporate". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Mission Statements + Vision Statements + Purpose Statements + Mission and Vision. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Mission/Vision-related keywords [("mission statement" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Mission/Vision Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Mission/Vision (1993); Mission Statements (1996); Mission and Vision Statements (1999-2017); Purpose, Mission, and Vision Statements (2022). Processing: Semantic Grouping: Data points across the different naming conventions were treated as a single conceptual series. Normalization: Combined series normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years (same names/years as Usability). Processing: Semantic Grouping: Data points treated as a single conceptual series. Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Mission/Vision dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  12. d

    Study comparing scaling with ranked subsampling (SRS) and rarefying for the...

    • search.dataone.org
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BonaRes Repository (2025). Study comparing scaling with ranked subsampling (SRS) and rarefying for the normalization of species count data@en [Dataset]. https://search.dataone.org/view/sha256%3Aaebd3b305a7c3e99931a960ae7b540813075528f5d73dbbff839d8cf8476a98f
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    BonaRes Repository
    Area covered
    Description

    Study comparing scaling with ranked subsampling (SRS) and rarefying for the normalization of species count data.

  13. S

    30 m-scale Annual Global Normalized Difference Urban Index Datasets from...

    • scidb.cn
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Di Liu; Yifang Wang; Xi Li; Qingling Zhang (2022). 30 m-scale Annual Global Normalized Difference Urban Index Datasets from 2000 to 2013 [Dataset]. http://doi.org/10.11922/sciencedb.01625
    Explore at:
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    ScienceDB
    Authors
    Di Liu; Yifang Wang; Xi Li; Qingling Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Urban areas play a very important role in global climate change. There is an increasing interest in comprehending global urban areas with adequate geographic details for global climate change mitigation. Accurate and frequent urban area information is fundamental to comprehend urbanization processes and land use/cover change, as well as the impact of global climate and environmental change. Defense Meteorological Satellite Program/Operational Line Scan System (DMSP/OLS) night-light (NTL) imagery contributes powerfully to the spatial characterization of global cities, however, its application potential is seriously limited by its coarse resolution. In this paper, we generate annual Normalized Difference Urban Index (NDUI) to characterize global urban areas at a 30 m-resolution from 2000 to 2013 by combining Landsat-7 Normalized Difference Vegetation Index (NDVI) composites and DMSP/OLS NTL images on the Google Earth Engine (GEE) platform. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI datasets have the potential for urbanization studies at regional and global scales.

  14. Z

    Network Traffic Analysis: Data and Code

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric (2024). Network Traffic Analysis: Data and Code [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_11479410
    Explore at:
    Dataset updated
    Jun 12, 2024
    Dataset provided by
    Loyola University Chicago
    Authors
    Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code:

    Packet_Features_Generator.py & Features.py

    To run this code:

    pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j

    -h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j

    Purpose:

    Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.

    Uses Features.py to calcualte the features.

    startMachineLearning.sh & machineLearning.py

    To run this code:

    bash startMachineLearning.sh

    This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags

    Options (to be edited within this file):

    --evaluate-only to test 5 fold cross validation accuracy

    --test-scaling-normalization to test 6 different combinations of scalers and normalizers

    Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use

    --grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'

    Purpose:

    Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

    Data

    Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.

    Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:

    First number is a classification number to denote what website, query, or vr action is taking place.

    The remaining numbers in each line denote:

    The size of a packet,

    and the direction it is traveling.

    negative numbers denote incoming packets

    positive numbers denote outgoing packets

    Figure 4 Data

    This data uses specific lines from the Virtual Reality.txt file.

    The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.

    The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.

    The .xlsx and .csv file are identical

    Each file includes (from right to left):

    The origional packet data,

    each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,

    and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.

  15. WikiMed and PubMedDS: Two large-scale datasets for medical concept...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Dec 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé (2021). WikiMed and PubMedDS: Two large-scale datasets for medical concept extraction and normalization research [Dataset]. http://doi.org/10.5281/zenodo.5753476
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).

    WikiMed

    Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.

    WikiMed contains:

    • 393,618 Wikipedia page texts
    • 1,067,083 mentions of medical concepts
    • 57,739 unique UMLS CUIs

    Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.

    PubMedDS

    Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.

    PubMedDS contains:

    • 13,197,430 abstract texts
    • 57,943,354 medical concept mentions
    • 44,881 unique UMLS CUIs

    Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.

    Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.

    Data format

    Both datasets use JSON format with one document per line. Each document has the following structure:

    {
      "_id": "A unique identifier of each document",
      "text": "Contains text over which mentions are ",
      "title": "Title of Wikipedia/PubMed Article",
      "split": "[Not in PubMedDS] Dataset split: 

  16. s

    R script to reproduce "Improved normalization of species count data in...

    • repository.soilwise-he.eu
    Updated Jul 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). R script to reproduce "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities". [Dataset]. https://repository.soilwise-he.eu/cat/collections/metadata:main/items/b7260968-33ab-4b37-8158-4c7f6a599a75
    Explore at:
    Dataset updated
    Jul 1, 2020
    Description

    The R script and data are available for download:
    https://metadata.bonares.de/smartEditor/rest/upload/ID_7050_2020_05_13_Beule_Karlovsky.zip

    R script and data for the reproduction of the paper entitled "Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities" by Lukas Beule and Petr Karlovsky.

    Comparison of scaling with ranked subsampling (SRS) with rarefying for the normalization of species count data in ecology. The example provided is a library obtained from next generation sequencing of a soil bacterial community. Different alpha diversity measures, community composition, and relative abundance of taxonomic units are compared.

  17. Z

    Data from: A harmonized Landsat Sentinel-2 (HLS) dataset for benchmarking...

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Consoli, Davide; Leal Parente, Leandro; Witjes, Martijn; Hengl, Tomislav (2023). A harmonized Landsat Sentinel-2 (HLS) dataset for benchmarking time series reconstruction methods of vegetation indices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8119406
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    OpenGeoHub Foundation
    Authors
    Consoli, Davide; Leal Parente, Leandro; Witjes, Martijn; Hengl, Tomislav
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Satellite images can be used to derive time series of vegetation indices, such as normalized difference vegetation index (NDVI) or enhanced vegetation index (EVI), at global scale. Unfortunately, recording artifacts, clouds, and other atmospheric contaminants impacts a significant portion of the produced images, requiring the usage of ad-hoc techniques to reconstruct the time series in the affected regions. In literature, several methods have been proposed to fill the gaps present in the images, and some works also presented performance comparisons between them (Roerink et al., 2000; Moreno-Martínez et al., 2020; Siabi et al., 2022). Because of the lack of a ground truth for the reconstructed images, the performance evaluation requires the creation of datasets where artificial gaps are introduced in a reference image, such that metrics like the root mean square error (RMSE) can be computed comparing the reconstructed images with the reference one. Different approaches have been used to create the reference images and the artificial gaps, but in most cases, the artificial gaps are introduced using arbitrary patterns and/or the reference image is produced artificially and not using real satellite images (e.g. Kandasamy et al., 2013; Liu et al., 2017; Julien & Sobrino, 2018). In addition, to the best of our knowledge, few of them are openly available and directly accessible allowing for fully reproducible research.

    We provide here a benchmark dataset for time series reconstruction method based on the harmonized Landsat Sentinel-2 (HLS) collection where the artificial gaps are introduced with a realistic spatio-temporal distribution. In particular, we selected six tiles that we considered representative for most of the main climate classes (e.g. equatorial, arid, warm temperature, boreal and polar), as depicted in the preview.

    Specifically, following the relative tiling system shown above, we downloaded the Red, NIR and F-mask bands from both the HLSL30 and HLSS30 collections for the tiles 19FCV, 22LEH, 32QPK, 31UFS, 45WFV and 49MWM. From the Red and NIR band we derived the NDVI as:

    (NDVI = {NIR - Red \over NIR + Red})

    only for clear-sky on lend pixels (F-mask bits 1, 3, 4 and 5 equal zero), setting as not a number the remaining pixels. The images are then aggregated on a 16 days base, averaging the available values for each pixel in each temporal range. The so obtained data, are considered from us as the reference data for the benchmarking, and stored following the file naming convention

    HLS.T..v2.0.NDVI.tif

    where TILE_NAME is one between the above specified ones, YYYY is the corresponding year (spanning from 2015 to 2022) and DDD is the day of the year from which the corresponding 16 days range starts. Finally, for each tile, we have a time series composed of 184 images (23 images for 8 years) that can be easily manipulated, for example using the Scikit-Map library in Python.

    Starting from those data, for each image we considered the mask of currently present gaps, we randomly rotated it by 90, 180 or 270 degrees and we added artificial gaps in the pixels of the rotated mask. Doing so, we believe that the spatio-temporal distribution will be still realistic, providing a solid benchmark for gap-filling methods that work on time series, on spatial pattern or combination of the both.

    The data including the artificial gaps are stored with the naming structure

    HLS.T..v2.0.NDVI_art_gaps.tif

    following the previously mentioned convention. The performance metrics, such as RMSE or normalized RMSE (NRMSE), can be computed by applying a reconstruction method on the images with artificial gaps, and then comparing the reconstructed time series with the reference one only on the artificially created gaps locations.

    This dataset was used to compare the performance of some gap-filling methods and we provide a Jupyter notebook that shows how to access and use the data. The files are provided in GeoTIFF format and projected in the coordinate reference system WGS 84 / UTM zone 19N (EPSG:32619).

    If you succeed to produce higher accuracy or develop a new algorithm for gap filling, please contact authors or post on our GitHub repository. May the force be with you!

    References:

    Julien, Y., & Sobrino, J. A. (2018). TISSBERT: A benchmark for the validation and comparison of NDVI time series reconstruction methods. Revista de Teledetección, (51), 19-31. https://doi.org/10.4995/raet.2018.9749

    Kandasamy, S., Baret, F., Verger, A., Neveux, P., & Weiss, M. (2013). A comparison of methods for smoothing and gap filling time series of remote sensing observations–application to MODIS LAI products. Biogeosciences, 10(6), 4055-4071. https://doi.org/10.5194/bg-10-4055-2013

    Liu, R., Shang, R., Liu, Y., & Lu, X. (2017). Global evaluation of gap-filling approaches for seasonal NDVI with considering vegetation growth trajectory, protection of key point, noise resistance and curve stability. Remote Sensing of Environment, 189, 164-179. https://doi.org/10.1016/j.rse.2016.11.023

    Moreno-Martínez, Á., Izquierdo-Verdiguier, E., Maneta, M. P., Camps-Valls, G., Robinson, N., Muñoz-Marí, J., ... & Running, S. W. (2020). Multispectral high resolution sensor fusion for smoothing and gap-filling in the cloud. Remote Sensing of Environment, 247, 111901. https://doi.org/10.1016/j.rse.2020.111901

    Roerink, G. J., Menenti, M., & Verhoef, W. (2000). Reconstructing cloudfree NDVI composites using Fourier analysis of time series. International Journal of Remote Sensing, 21(9), 1911-1917. https://doi.org/10.1080/014311600209814

    Siabi, N., Sanaeinejad, S. H., & Ghahraman, B. (2022). Effective method for filling gaps in time series of environmental remote sensing data: An example on evapotranspiration and land surface temperature images. Computers and Electronics in Agriculture, 193, 106619. https://doi.org/10.1016/j.compag.2021.106619

  18. i

    The FluxnetEO dataset (Landsat) for American stations located in United...

    • meta.icos-cp.eu
    Updated Nov 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Besnard; Jacob A. Nelson; Sophia Walther; Ulrich Weber (2021). The FluxnetEO dataset (Landsat) for American stations located in United States (0) [Dataset]. https://meta.icos-cp.eu/objects/O2wzjvBSss9oj7aW3Icu4_jv
    Explore at:
    Dataset updated
    Nov 19, 2021
    Dataset provided by
    Carbon Portal
    ICOS data portal
    Authors
    Simon Besnard; Jacob A. Nelson; Sophia Walther; Ulrich Weber
    License

    http://meta.icos-cp.eu/ontologies/cpmeta/icosLicencehttp://meta.icos-cp.eu/ontologies/cpmeta/icosLicence

    Time period covered
    Jan 1, 1984 - Dec 31, 2017
    Area covered
    US-AR1, US-Aud, United States, United States, United States, United States, United States, United States, United States, United States
    Description

    Quality checked and gap-filled monthly Landsat observations of surface reflectance at global eddy co-variance sites for the time period 1984-2017. Two product versions: one features all Landsat pixels within 2km radius around a given site, and a second version consists of an average time series that represents the area within 1km2 around a site. All data layers have a complementary layer with gap-fill information. Landsat data comprise all sites in the Fluxnet La Thuile, Berkeley and ICOS Drought 2018 data releases. Reflectance products: enhanced vegetation index (EVI), normalized difference vegetation index (NDVI), generalized NDVI (kNDVI), near infra-red reflectance of vegetation (NIRv), normalized difference water index (NDWI) with both shortwave infra-red bands as reference, the scaled wide dynamic range vegetation index (sWDRVI), surface reflectance in individual Landsat bands. Based on the Landsat4,5,7,8 collection 1 products with a pixel size of 30m. Supplementary data to Walther, S., Besnard, S., Nelson, J.A., El-Madany, T. S., Migliavacca, M., Weber, U., Ermida, S. L., Brümmer, C., Schrader, F., Prokushkin, A., Panov, A., Jung, M. , 2021. A view from space on global flux towers by MODIS and Landsat: The FluxnetEO dataset, in preparation for Biogeosciences Discussions. ZIP archive of netcdf files for the stations in Americas : US-AR1, US-AR2, US-ARM, US-ARb, US-ARc, US-Atq, US-Aud, US-Bar, US-Bkg, US-Blo, US-Bn2, US-Bn3, US-Bo1, US-Bo2, US-Brw, US-CRT, US-CaV, US-Cop, US-Dk3, US-FPe Besnard, S., Nelson, J., Walther, S., Weber, U. (2021). The FluxnetEO dataset (Landsat) for American stations located in United States (0), 1984-01-01–2017-12-30, https://hdl.handle.net/11676/O2wzjvBSss9oj7aW3Icu4_jv

  19. EV_Dataset

    • kaggle.com
    zip
    Updated Jul 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Srijan Upadhyay (2025). EV_Dataset [Dataset]. https://www.kaggle.com/datasets/srijan1upadhyay/ev-dataset
    Explore at:
    zip(10947947 bytes)Available download formats
    Dataset updated
    Jul 13, 2025
    Authors
    Srijan Upadhyay
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🔋 Electric Vehicle (EV) Dataset for Machine Learning and EDA

    This is a ** EV dataset** created for deep-level EDA, machine learning, and feature engineering exercises.

    📁 File: ev_dataset.csv
    📦 Size: 250,000 rows × 22 columns

    📘 About This Dataset

    This dataset simulates EV specifications, pricing, geography, and performance metrics that resemble scraped data from multiple auto platforms globally.

    It is crafted to: - Simulate raw but structured data - Allow exploration, cleaning, and transformation - Train machine learning models for classification, regression, and recommendation tasks

    🧠 Use Cases

    📊 Exploratory Data Analysis (EDA)

    • Analyze price, range, battery size distribution
    • Study EV availability across countries & cities
    • Compare manufacturers and efficiency

    🔧 Data Cleaning & Preprocessing

    • Missing values, outlier detection
    • Encoding: one-hot, label, ordinal
    • Normalization, scaling

    🧠 Machine Learning Tasks

    🔹 Classification

    • Predict target_high_efficiency based on features
      #### 🔹 Regression
    • Predict range_km or price_usd using other specs
      #### 🔹 Clustering
    • Group similar EVs by range, performance, cost
      #### 🔹 Recommendation Systems
    • Suggest EVs by user constraints (budget, speed, efficiency)

    🗃️ Columns Overview

    ColumnTypeDescription
    manufacturerstringEV brand (Tesla, BYD, etc.)
    modelstringModel name (Model S, Leaf, etc.)
    typestringVehicle type (SUV, Sedan, etc.)
    drive_typestringDrivetrain: AWD, FWD, RWD
    fuel_typestringElectric or Hybrid
    colorstringExterior color
    battery_kwhfloatBattery capacity in kWh
    range_kmfloatEstimated range in kilometers
    charging_time_hrfloatTime to charge 0–100% in hours
    fast_chargingbooleanSupports fast charging (True/False)
    release_yearintModel release year
    countrystringAvailable country
    citystringCity of availability
    seatsintNumber of seats
    price_usdfloatPrice in USD
    efficiency_scorefloatRange per kWh efficiency score
    acceleration_0_100_kmphfloat0–100 km/h acceleration time (seconds)
    top_speed_kmphfloatTop speed in km/h
    warranty_yearsintWarranty period in years
    cargo_space_litersfloatCargo/trunk capacity
    safety_ratingfloatSafety rating (out of 5.0)
    target_high_efficiencybinaryTarget label: 1 if efficiency > 5.0, else 0

    ✅ Ideal For

    • Data science portfolio projects
    • Feature engineering challenges
    • End-to-end ML pipelines
    • Model interpretability and SHAP analysis
    • ML model benchmarking

    🚀 Next Steps (Pick One or All)

    • 🔍 Perform EDA using pandas-profiling or SweetViz
    • ⚙️ Build preprocessing pipeline (missing values, scaling, encoding)
    • 🧠 Train ML models to classify target_high_efficiency
    • 📈 Use plotly or seaborn for insightful visualizations
    • 💡 Try deploying a recommender app using streamlit
  20. a

    Unmanned Aerial Systems (UAS) Multispectral & Normalized difference...

    • arcticdata.io
    • search-demo.dataone.org
    • +2more
    Updated Jun 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio A. Vargas Zesati; Stephen M. Escarzaga; Craig E. Tweedie; Jeremy May; Robert Hollister; Steven Oberbauer (2020). Unmanned Aerial Systems (UAS) Multispectral & Normalized difference vegetation index (NDVI) Orthomosaics of Subset area within Circumpolar Active Layer Monitoring (CALM) grid Atqasuk, Alaska, USA, 2019 [Dataset]. http://doi.org/10.18739/A2V698C71
    Explore at:
    Dataset updated
    Jun 15, 2020
    Dataset provided by
    Arctic Data Center
    Authors
    Sergio A. Vargas Zesati; Stephen M. Escarzaga; Craig E. Tweedie; Jeremy May; Robert Hollister; Steven Oberbauer
    Time period covered
    Jul 30, 2019
    Area covered
    Variables measured
    NDVI, Band 1, Band 2, Band 3, Band 4, Band 5, Band 6
    Description

    Beginning in the summer of 2018, high resolution multispectral Unmanned Aerial Systems (UAS) datasets were collected across multiple sites in northern Alaska in an effort to characterize landcover, test remote sensing scaling methods and for use in validating satellite data products. Orthomosaic datasets were developed from multispectral images acquired using a Micasense RedEdge-M sensor and using standardized structure-from-Motion (SfM) photogrammetry workflows. UAS data products reflect landscape and vegetation composition and status at time of flight. We produced two data products (1) 6 band multispectral orthomosaic of Circumpolar Active Layer Monitoring Program (CALM) subset area and (2) 1 band normalized Normalized difference vegetation index (NDVI) orthomosaic of CALM subset area. All data have been calibrated using the proprietary Micasense grey-scale reflectance panel and/or the on-board drone sun sensor. The 6 band multispectral image consists of: Band 1: Blue (475 nm center, 20 nm bandwidth), Band 2: Green (560 nm center, 20 nm bandwidth), Band 3: Red (668 nm center, 10 nm bandwidth), Band 4: Red Edge (717 nm center, 10 nm bandwidth), Band 5: Near-IR (840 nm center, 40 nm bandwidth), Band 6: NDVI (not scaled) The 1 band NDVI product consists of: Band 1: Scaled NDVI map (calculated within ArcMap)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai (2023). A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0176185
Organization logo

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

Explore at:
30 scholarly articles cite this dataset (View in Google Scholar)
pdfAvailable download formats
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xiaohong Li; Guy N. Brock; Eric C. Rouchka; Nigel G. F. Cooper; Dongfeng Wu; Timothy E. O’Toole; Ryan S. Gill; Abdallah M. Eteleeb; Liz O’Brien; Shesh N. Rai
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (

Search
Clear search
Close search
Google apps
Main menu