67 datasets found
  1. t

    Single non-normalized data of electron probe analyses of all glass shard...

    • service.tib.eu
    • doi.pangaea.de
    • +2more
    Updated Nov 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Single non-normalized data of electron probe analyses of all glass shard samples from the Seward Peninsula and the Lipari obsidian reference standard [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-859554
    Explore at:
    Dataset updated
    Nov 30, 2024
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Seward Peninsula
    Description

    Permafrost degradation influences the morphology, biogeochemical cycling and hydrology of Arctic landscapes over a range of time scales. To reconstruct temporal patterns of early to late Holocene permafrost and thermokarst dynamics, site-specific palaeo-records are needed. Here we present a multi-proxy study of a 350-cm-long permafrost core from a drained lake basin on the northern Seward Peninsula, Alaska, revealing Lateglacial to Holocene thermokarst lake dynamics in a central location of Beringia. Use of radiocarbon dating, micropalaeontology (ostracods and testaceans), sedimentology (grain-size analyses, magnetic susceptibility, tephra analyses), geochemistry (total nitrogen and carbon, total organic carbon, d13Corg) and stable water isotopes (d18O, dD, d excess) of ground ice allowed the reconstruction of several distinct thermokarst lake phases. These include a pre-lacustrine environment at the base of the core characterized by the Devil Mountain Maar tephra (22 800±280 cal. a BP, Unit A), which has vertically subsided in places due to subsequent development of a deep thermokarst lake that initiated around 11 800 cal. a BP (Unit B). At about 9000 cal. a BP this lake transitioned from a stable depositional environment to a very dynamic lake system (Unit C) characterized by fluctuating lake levels, potentially intermediate wetland development, and expansion and erosion of shore deposits. Complete drainage of this lake occurred at 1060 cal. a BP, including post-drainage sediment freezing from the top down to 154 cm and gradual accumulation of terrestrial peat (Unit D), as well as uniform upward talik refreezing. This core-based reconstruction of multiple thermokarst lake generations since 11 800 cal. a BP improves our understanding of the temporal scales of thermokarst lake development from initiation to drainage, demonstrates complex landscape evolution in the ice-rich permafrost regions of Central Beringia during the Lateglacial and Holocene, and enhances our understanding of biogeochemical cycles in thermokarst-affected regions of the Arctic.

  2. Clean Meta Kaggle

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yoni Kremer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Meta-Kaggle Dataset

    The Original Dataset - Meta-Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

    The Problems with the Original Dataset

    • The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
    • The data is not normalized, so when you join tables you get a lot of errors.
    • Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
    • There are missing values.
    • There are duplicate values.
    • There are values that are not valid. For example, Ids that are not positive integers.
    • The date and time columns are not in the right format.
    • Some columns only have the same value for all rows, so they are not useful.
    • The boolean columns have string values True or False.
    • Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
    • Users upvote their own messages.

    The Solution

    • To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
    • The steps to create the database are:
    • Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
    • Downloading the CSV files from Kaggle using the Kaggle API.
    • Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
      • Drops the columns that are not needed.
      • Converts each column to the right data type.
      • Replaces foreign keys that do not exist with NULL.
      • Replaces some of the missing values with default values.
      • Removes rows where there are missing values in the primary key/not null columns.
      • Removes duplicate rows.
    • Loading the data into the database using the LOAD DATA INFILE command.
    • Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
    • Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
    • Update the Total columns in the database tables. I do that by running the update_totals.sql script.
    • Backup the database.
  3. n

    Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    University of Toronto
    Hospital for Sick Children
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  4. f

    Data from: Filtration and Normalization of Sequencing Read Data in...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Losada, Patricia Moran; Chouvarine, Philippe; DeLuca, David S.; Wiehlmann, Lutz; Tümmler, Burkhard (2016). Filtration and Normalization of Sequencing Read Data in Whole-Metagenome Shotgun Samples [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001539099
    Explore at:
    Dataset updated
    Oct 20, 2016
    Authors
    Losada, Patricia Moran; Chouvarine, Philippe; DeLuca, David S.; Wiehlmann, Lutz; Tümmler, Burkhard
    Description

    Ever-increasing affordability of next-generation sequencing makes whole-metagenome sequencing an attractive alternative to traditional 16S rDNA, RFLP, or culturing approaches for the analysis of microbiome samples. The advantage of whole-metagenome sequencing is that it allows direct inference of the metabolic capacity and physiological features of the studied metagenome without reliance on the knowledge of genotypes and phenotypes of the members of the bacterial community. It also makes it possible to overcome problems of 16S rDNA sequencing, such as unknown copy number of the 16S gene and lack of sufficient sequence similarity of the “universal” 16S primers to some of the target 16S genes. On the other hand, next-generation sequencing suffers from biases resulting in non-uniform coverage of the sequenced genomes. To overcome this difficulty, we present a model of GC-bias in sequencing metagenomic samples as well as filtration and normalization techniques necessary for accurate quantification of microbial organisms. While there has been substantial research in normalization and filtration of read-count data in such techniques as RNA-seq or Chip-seq, to our knowledge, this has not been the case for the field of whole-metagenome shotgun sequencing. The presented methods assume that complete genome references are available for most microorganisms of interest present in metagenomic samples. This is often a valid assumption in such fields as medical diagnostics of patient microbiota. Testing the model on two validation datasets showed four-fold reduction in root-mean-square error compared to non-normalized data in both cases. The presented methods can be applied to any pipeline for whole metagenome sequencing analysis relying on complete microbial genome references. We demonstrate that such pre-processing reduces the number of false positive hits and increases accuracy of abundance estimates.

  5. H

    GC/MS Simulated Data Sets including batch effects and data truncation (not...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jan 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denise Scholtens (2017). GC/MS Simulated Data Sets including batch effects and data truncation (not normalized) [Dataset]. http://doi.org/10.7910/DVN/JDRJGY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Denise Scholtens
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are simulated data sets that include batch effects and data truncation and are not yet normalized.

  6. Data from: X-ray fluorescence (XRF) scanning data (raw data, not normalized)...

    • doi.pangaea.de
    html, tsv
    Updated Mar 26, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Grunert; Ursula Röhl; Antje H L Voelker; Carlota Escutia; Carlos A Alvarez Zarikian; David A Hodell; André Bahr; Francisco Jose Jiménez-Espejo; Nada Kolasinac; Francisco Javier Hernandéz-Molina; Dorrik A V Stow (2014). X-ray fluorescence (XRF) scanning data (raw data, not normalized) from IODP Site U1387, 0-200 mcd [Dataset]. http://doi.org/10.1594/PANGAEA.831119
    Explore at:
    tsv, htmlAvailable download formats
    Dataset updated
    Mar 26, 2014
    Dataset provided by
    PANGAEA
    Authors
    Patrick Grunert; Ursula Röhl; Antje H L Voelker; Carlota Escutia; Carlos A Alvarez Zarikian; David A Hodell; André Bahr; Francisco Jose Jiménez-Espejo; Nada Kolasinac; Francisco Javier Hernandéz-Molina; Dorrik A V Stow
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Variables measured
    AGE, Depth, composite, Sample code/label, DEPTH, sediment/rock, Tin, area, total counts, Iron, area, total counts, Lead, area, total counts, Zinc, area, total counts, Barium, area, total counts, Copper, area, total counts, and 24 more
    Description

    This dataset is about: X-ray fluorescence (XRF) scanning data (raw data, not normalized) from IODP Site U1387, 0-200 mcd. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.831133 for more information.

  7. Normalized Metabolomics Data

    • figshare.com
    xlsx
    Updated Jan 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tyler Hilsabeck (2022). Normalized Metabolomics Data [Dataset]. http://doi.org/10.6084/m9.figshare.17712401.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 13, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Tyler Hilsabeck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Normalized metabolomics data for 178 DGRP strains on 2 diets.

  8. SMMnet - Super Mario Maker

    • kaggle.com
    zip
    Updated May 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Moraes (2019). SMMnet - Super Mario Maker [Dataset]. https://www.kaggle.com/datasets/leomauro/smmnet/discussion
    Explore at:
    zip(106558315 bytes)Available download formats
    Dataset updated
    May 13, 2019
    Authors
    Leonardo Moraes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    Online games have become a popular form of entertainment, reaching millions of players. Games are dynamic environments, in which the players interact with the game and with other players all over the world. In this sense, games present rich data due to its digital nature. Thus, it is a promising environment to study and apply Artificial Intelligence and Data Mining techniques.

    Context

    In this Kaggle Dataset, we provide over 115 thousand games maps created on Super Mario Maker with over 880 thousand players which performed over 7 millions of interactions on these maps. By interactions, this means that a player can: (1) create a game map; (2) play a map created by other players; if a player completes the challenge of the game map, he/she (3) "cleared" the map; also can be the (4) first clear; beat the (5) time record of a map; (6) at any time, the player can "like" a game map. Note, this dataset present temporal changes over time for each game map by a period of three months.

    The data was extracted from supermariomakerbookmark.nintendo.net, the game website. Now it is publicly to everyone play, explore and research. This dataset serves as a good base for learning models, including, but not limited to, Player Modeling (e.g., player experience), Data Mining (e.g., prediction, and find patterns), and Social Network Analysis (e.g., community detection, link prediction, ranking).

    Dataset

    This dataset is split into seven files:
    - courses.csv: game maps data.
    - course-meta.csv: temporal changes on game maps.
    - players.csv: players' data.
    - plays.csv: plays over time.
    - clears.csv: clears over time.
    - likes.csv: likes over time.
    - records.csv: records over time.

    Data Description

    https://i.imgur.com/iY69dnT.png" alt="Schema for SMMnet">

    The figure illustrates a schema with non-normalized tables to store the SMMnet into a Relational Database Management System (RDBMS). Basically, it is composed of seven tables, each one for one CSV file, that include the maps, players, and the changes over time. Note, there are two Primary Keys (PK) on these tables, i.e., courses.id and players.id for be linked by the Foreign Keys (FK) of other tables to reference and associate with them.

    Inspiration

    1. Detecting game influencers (e.g., twitch and YouTube users). Work.
    2. Predict the popularity of a game over time.
    3. Identify popular games characteristics.
    4. Player Modeling (e.g., player activity)
    5. Social Network Analysis (e.g., community detection, link prediction)
    6. Your creativity!

    Acknowledgements

    • Nintendo Inc., Kyoto, Japan by created this amazing game.
    • Photo by Cláudio Luiz Castro on Unsplash.

    Citation

    SMMNet is available for researchers and data scientists under the Creative Commons BY license. In case of publication and/or public use, as well as any dataset derived from it, one should acknowledge its creators by citing us. Bibtex.

  9. H

    Benchmarking (Normalized)

    • datasetcatalog.nlm.nih.gov
    • search.dataone.org
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anez, Dimar; Anez, Diomar (2025). Benchmarking (Normalized) [Dataset]. http://doi.org/10.7910/DVN/VW7AAX
    Explore at:
    Dataset updated
    May 6, 2025
    Authors
    Anez, Dimar; Anez, Diomar
    Description

    This dataset provides processed and normalized/standardized indices for the management tool 'Benchmarking'. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Benchmarking dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "benchmarking" + "benchmarking management". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Benchmarking. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Benchmarking-related keywords ["benchmarking" AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Benchmarking Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Benchmarking (1993, 1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Note: Not reported in 2022 survey data. Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Benchmarking (1993-2017). Note: Not reported in 2022 survey data. Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Benchmarking dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  10. s

    Danish Similarity Data Set

    • sprogteknologi.dk
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Language Technology, NorS, University of Copenhagen (2024). Danish Similarity Data Set [Dataset]. https://sprogteknologi.dk/dataset/danish-similarity-data-set
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/csvAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Center for Sprogteknologi
    Authors
    Centre for Language Technology, NorS, University of Copenhagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Denmark
    Description

    The Danish similarity dataset is a gold standard resource for evaluation of Danish word embedding models. The dataset consists of 99 word pairs rated by 38 human judges according to their semantic similarity, i.e. the extend to which the two words are similar in meaning, in a normalized 0-1 range. Note that this dataset provides a way of measuring similarity rather than relatedness/association. Description of files included in this material: (Note: In both of the included files, rows correspond to items (word pairs) and columns to properties of each item.) All_sims_da.csv: Contains the non-normalized mean similarity scores over all judges, along with the non-normalized scores given by each of the 38 judges on the scale 0-6, where 0 is given to the most dissimilar items and 6 to the most similar items. Gold_sims_da.csv: Contains the similarity gold standard for each item, which is the normalized mean similarity score for a given item over all judges. Scores are normalized to a 0-1 range, where 0 denotes the minimum degree of similarity and 1 denotes the maximum degree of similarity.

  11. Data from: Raw (not normalized) X-ray fluorescence (XRF) scanning data of...

    • doi.pangaea.de
    html, tsv
    Updated Mar 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Grunert; Ursula Röhl; Antje H L Voelker; Carlota Escutia; Carlos A Alvarez Zarikian; David A Hodell; André Bahr; Francisco Jose Jiménez-Espejo; Nada Kolasinac; Francisco Javier Hernandéz-Molina; Dorrik A V Stow (2014). Raw (not normalized) X-ray fluorescence (XRF) scanning data of IODP Site U1389, 0-125 mcd [Dataset]. http://doi.org/10.1594/PANGAEA.831047
    Explore at:
    html, tsvAvailable download formats
    Dataset updated
    Mar 25, 2014
    Dataset provided by
    PANGAEA
    Authors
    Patrick Grunert; Ursula Röhl; Antje H L Voelker; Carlota Escutia; Carlos A Alvarez Zarikian; David A Hodell; André Bahr; Francisco Jose Jiménez-Espejo; Nada Kolasinac; Francisco Javier Hernandéz-Molina; Dorrik A V Stow
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Variables measured
    Sample ID, Depth, composite, Sample code/label, DEPTH, sediment/rock, Iron, area, total counts, Lead, area, total counts, Zinc, area, total counts, Barium, area, total counts, Copper, area, total counts, Nickel, area, total counts, and 19 more
    Description

    This dataset is about: Raw (not normalized) X-ray fluorescence (XRF) scanning data of IODP Site U1389, 0-125 mcd. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.831133 for more information.

  12. F

    Leading Indicators OECD: Component Series: Production: Non-Metallic:...

    • fred.stlouisfed.org
    json
    Updated Aug 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Leading Indicators OECD: Component Series: Production: Non-Metallic: Normalised for Indonesia [Dataset]. https://fred.stlouisfed.org/series/IDNLOCOPMNOSTSAM
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Aug 10, 2023
    License

    https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required

    Area covered
    Indonesia
    Description

    Graph and download economic data for Leading Indicators OECD: Component Series: Production: Non-Metallic: Normalised for Indonesia (IDNLOCOPMNOSTSAM) from Jan 1993 to Dec 2019 about leading indicator, Indonesia, and production.

  13. Z

    Data For Calculating 1880 to 1975 sea surface temperature

    • data.niaid.nih.gov
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Williams, Bruce (2020). Data For Calculating 1880 to 1975 sea surface temperature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1419641
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Retired
    Authors
    Williams, Bruce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data used in the programs 8075DegRise_CO2NMDPSIRR_V5.bas (DOI 10.5281/zenodo.1418561) and 8075DegRise_NoNMDP_V1.bas (DOI 10.5281/zenodo.1419629).

    All data is entered one item per line.The order of the entries is:

    First entry – The total number of sea surface temperature entries.

    Second entry – The number of entries for all other data.

    Third entry – The offset from the first entry of a set of data to the year 1880. This is the same for all sets of data except sea surface temperature which is based at 1880 and never changes.

    The nonzeroed non-normalized sea surface temperature anomalies, the number of which is specified by the First entry

    The nonzeroed non-normalized North Magnetic Dip Pole kilometers moved from the previous year, the number of entries is specified in the Second entry.

    The nonzeroed non-normalized solar irradiation average for this year, the number of entries is specified in the Second entry.

    The nonzeroed non-normalized CO2 ppm average for this year, the number of entries is specified in the Second entry.

  14. H

    Knowledge Management (Normalized)

    • datasetcatalog.nlm.nih.gov
    • search.dataone.org
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anez, Diomar; Anez, Dimar (2025). Knowledge Management (Normalized) [Dataset]. http://doi.org/10.7910/DVN/BAPIEP
    Explore at:
    Dataset updated
    May 6, 2025
    Authors
    Anez, Diomar; Anez, Dimar
    Description

    This dataset provides processed and normalized/standardized indices for the management tool 'Knowledge Management' (KM), including related concepts like Intellectual Capital Management and Knowledge Transfer. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding KM dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "knowledge management" + "knowledge management organizational". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Knowledge Management + Intellectual Capital Management + Knowledge Transfer. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching KM-related keywords [("knowledge management" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (KM Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Knowledge Management (1999, 2000, 2002, 2004, 2006, 2008, 2010). Note: Not reported after 2010. Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Knowledge Management (1999-2010). Note: Not reported after 2010. Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding KM dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  15. H

    Growth Strategies (Normalized)

    • datasetcatalog.nlm.nih.gov
    • search.dataone.org
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anez, Diomar; Anez, Dimar (2025). Growth Strategies (Normalized) [Dataset]. http://doi.org/10.7910/DVN/OW8GOW
    Explore at:
    Dataset updated
    May 6, 2025
    Authors
    Anez, Diomar; Anez, Dimar
    Description

    This dataset provides processed and normalized/standardized indices for the management tool group focused on 'Growth Strategies'. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Growth Strategies dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "growth strategies" + "growth strategy" + "growth strategies business". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Growth Strategies + Growth Strategy. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Growth Strategies-related keywords [("growth strategies" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Growth Strat. Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Growth Strategies (1996, 1999, 2000, 2002, 2004); Growth Strategy Tools (2006, 2008). Note: Not reported after 2008. Processing: Semantic Grouping: Data points for "Growth Strategies" and "Growth Strategy Tools" were treated as a single conceptual series. Normalization: Combined series normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Growth Strategies (1996-2004); Growth Strategy Tools (2006, 2008). Note: Not reported after 2008. Processing: Semantic Grouping: Data points treated as a single conceptual series. Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Growth Strategies dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  16. Aarhus University,Danish Centre for Environment and Energy

    • catalogue.arctic-sdi.org
    • pigma.org
    doi, ogc:wms +2
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute for Marine Research and Development "Grigore Antipa" (2025). Aarhus University,Danish Centre for Environment and Energy [Dataset]. https://catalogue.arctic-sdi.org/geonetwork/srv/api/records/ad373548-8425-47d2-b1bd-e9bee1797df3
    Explore at:
    www:link, www:download, ogc:wms, doiAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    Institute of Oceanologyhttp://www.io-bas.bg/
    United Nations Environment Programmehttp://www.unep.org/
    International Council for the Exploration of the Sea
    Hellenic Centre for Marine Researchhttps://www.hcmr.gr/en/
    National Institute of Marine Geology and Geoecology
    Archipelagos Institute of Marine Conservation
    Plastic Change
    Black Sea NGO Network
    The North Sea Foundation
    The Environment Agency of Iceland
    Institute of Marine Biology (IMBK)
    Surfers Against Sewage
    Ukrainian scientific center of Ecology of Sea
    State Oceanographic Institute
    Flanders Marine Institute
    Aegean Greeners
    Isotech Ltd Environmental Research and Consultancy
    Portuguese Environment Agency
    Norwegian Environment Agency
    Hold Danmark Rent
    Department of Fisheries and Marine Research, Division of Marine Biology and Ecology
    Disciplinary Centre of Marine Research and Environmental
    ECAT-Environmental Center for Administration & Technology
    Ifremer, VIGIES (Information Valuation Service for Integrated Management and Monitoring)
    Venice Lagoon Plastic Free
    Iv.Javakhishvili Tbilisi State University, Centre of Relations with UNESCO Oceanological Research Centre and GeoDNA (UNESCO)
    IFREMER, SISMER, Scientific Information Systems for the SEA
    Treanbeg Marine
    National Institute of Oceanography and Applied Geophysics - OGS, Division of Oceanography
    Turkish Marine Research Foundation
    Directorate for Coast and Sea Sustainability. Ministry for Ecological Transition
    Portuguese Association for Marine Litter, APLM
    Estonian Green Movement
    Non-governmental environmental organization "Mare Nostrum"
    Centre for Documentation, Research and Experimentation on Accidental Water Pollution
    Institute for Water of the Republic of Slovenia
    University of Maribor
    Asociación Vertidos Cero
    Legambiente
    Mohamed I University
    Mediterranean Information Office for Environment, Culture and Sustainable Development
    National Institute for Marine Research and Development "Grigore Antipa"
    Time period covered
    Jan 1, 2001 - May 11, 2024
    Area covered
    Description

    This visualization product displays the cigarette related items abundance of marine macro-litter (> 2.5cm) per beach per year from non-MSFD monitoring surveys, research & cleaning operations without UNEP-MARLIN data.

    EMODnet Chemistry included the collection of marine litter in its 3rd phase. Since the beginning of 2018, data of beach litter have been gathered and processed in the EMODnet Chemistry Marine Litter Database (MLDB).

    The harmonization of all the data has been the most challenging task considering the heterogeneity of the data sources, sampling protocols and reference lists used on a European scale.

    Preliminary processings were necessary to harmonize all the data:

    • Exclusion of OSPAR 1000 protocol: in order to follow the approach of OSPAR that it is not including these data anymore in the monitoring;

    • Selection of surveys from non-MSFD monitoring, cleaning and research operations;

    • Exclusion of beaches without coordinates;

    • Selection of cigarette related items only. The list of selected items is attached to this metadata. This list was created using EU Marine Beach Litter Baselines, the European Threshold Value for Macro Litter on Coastlines and the Joint list of litter categories for marine macro-litter monitoring from JRC (these three documents are attached to this metadata);

    • Exclusion of surveys without associated length;

    • Exclusion of surveys referring to the UNEP-MARLIN list: the UNEP-MARLIN protocol differs from the other types of monitoring in that cigarette butts are surveyed in a 10m square. To avoid comparing abundances from very different protocols, the choice has been made to distinguish in two maps the cigarette related items results associated with the UNEP-MARLIN list from the others;

    • Normalization of survey lengths to 100m & 1 survey / year: in some case, the survey length was not 100m, so in order to be able to compare the abundance of litter from different beaches a normalization is applied using this formula:

    Number of cigarette related items of the survey (normalized by 100 m) = Number of cigarette related items of the survey x (100 / survey length)

    Then, this normalized number of cigarette related items is summed to obtain the total normalized number of cigarette related items for each survey. Finally, the median abundance of cigarette related items for each beach and year is calculated from these normalized abundances of cigarette related items per survey.

    Percentiles 50, 75, 95 & 99 have been calculated taking into account cigarette related items from other sources data (excluding UNEP-MARLIN protocol) for all years.

    More information is available in the attached documents.

    Warning: the absence of data on the map does not necessarily mean that they do not exist, but that no information has been entered in the Marine Litter Database for this area.

  17. h

    Covariance matrix $p_{T}^{\ell, non-Z}$, parton, normalized

    • hepdata.net
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Covariance matrix $p_{T}^{\ell, non-Z}$, parton, normalized [Dataset]. http://doi.org/10.17182/hepdata.146693.v1/t372
    Explore at:
    Dataset updated
    May 16, 2024
    Description

    Covariance matrix for normalized cross section as a function of $p_{\text{T}}^{\ell, non-Z}$ at parton-level.

  18. UFO Reports Dataset(80,000+)

    • kaggle.com
    zip
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shresth Agrawal (2024). UFO Reports Dataset(80,000+) [Dataset]. https://www.kaggle.com/datasets/shresthagrawal7/ufo-reports-dataset80000
    Explore at:
    zip(10712714 bytes)Available download formats
    Dataset updated
    Aug 10, 2024
    Authors
    Shresth Agrawal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Data Set

    file:> ufo-complete-geocoded-time-normalized.csv

    Complete the original data set containing resolved and unresolved locations and convert and not convert normalized time to seconds. 88874 total records, 724 locations not found or blank (0.8146%), 7131 erroneous time or blank (8.0237%)

    file:> ufo-scrubbed-geocode-time-normalized.csv

    Scrubbed data set with only non-zero resolved locations and >0 normalized time. 81185 total records, 0 locations not found, 0 erroneous time or blank records.

  19. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  20. H

    GC/MS Simulated Data Sets normalized using quantile normalization

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jan 25, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denise Scholtens (2017). GC/MS Simulated Data Sets normalized using quantile normalization [Dataset]. http://doi.org/10.7910/DVN/R3P9SS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Denise Scholtens
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using quantile normalization (Bolstad et al. 2003).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Single non-normalized data of electron probe analyses of all glass shard samples from the Seward Peninsula and the Lipari obsidian reference standard [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-859554

Single non-normalized data of electron probe analyses of all glass shard samples from the Seward Peninsula and the Lipari obsidian reference standard

Explore at:
Dataset updated
Nov 30, 2024
License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Area covered
Seward Peninsula
Description

Permafrost degradation influences the morphology, biogeochemical cycling and hydrology of Arctic landscapes over a range of time scales. To reconstruct temporal patterns of early to late Holocene permafrost and thermokarst dynamics, site-specific palaeo-records are needed. Here we present a multi-proxy study of a 350-cm-long permafrost core from a drained lake basin on the northern Seward Peninsula, Alaska, revealing Lateglacial to Holocene thermokarst lake dynamics in a central location of Beringia. Use of radiocarbon dating, micropalaeontology (ostracods and testaceans), sedimentology (grain-size analyses, magnetic susceptibility, tephra analyses), geochemistry (total nitrogen and carbon, total organic carbon, d13Corg) and stable water isotopes (d18O, dD, d excess) of ground ice allowed the reconstruction of several distinct thermokarst lake phases. These include a pre-lacustrine environment at the base of the core characterized by the Devil Mountain Maar tephra (22 800±280 cal. a BP, Unit A), which has vertically subsided in places due to subsequent development of a deep thermokarst lake that initiated around 11 800 cal. a BP (Unit B). At about 9000 cal. a BP this lake transitioned from a stable depositional environment to a very dynamic lake system (Unit C) characterized by fluctuating lake levels, potentially intermediate wetland development, and expansion and erosion of shore deposits. Complete drainage of this lake occurred at 1060 cal. a BP, including post-drainage sediment freezing from the top down to 154 cm and gradual accumulation of terrestrial peat (Unit D), as well as uniform upward talik refreezing. This core-based reconstruction of multiple thermokarst lake generations since 11 800 cal. a BP improves our understanding of the temporal scales of thermokarst lake development from initiation to drainage, demonstrates complex landscape evolution in the ice-rich permafrost regions of Central Beringia during the Lateglacial and Holocene, and enhances our understanding of biogeochemical cycles in thermokarst-affected regions of the Arctic.

Search
Clear search
Close search
Google apps
Main menu