94 datasets found
  1. f

    Binary classification using a confusion matrix.

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). Binary classification using a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

  2. d

    GC/MS Simulated Data Sets normalized using median scaling

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scholtens, Denise (2023). GC/MS Simulated Data Sets normalized using median scaling [Dataset]. http://doi.org/10.7910/DVN/OYOLXD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Scholtens, Denise
    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using median scaling as described in Reisetter et al.

  3. f

    The performance results for k-means clustering and testing the hypothesis...

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units.

  4. MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A...

    • researchdata.edu.au
    • data.mendeley.com
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering (2023). MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A Benchmark Dataset [Dataset]. http://doi.org/10.17632/6D8V9JMVGM.1
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    Mendeley Ltd.
    The University of Western Australia
    Authors
    Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering
    Description

    his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.

    In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.

    The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.

  5. WikiMed and PubMedDS: Two large-scale datasets for medical concept...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Dec 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé (2021). WikiMed and PubMedDS: Two large-scale datasets for medical concept extraction and normalization research [Dataset]. http://doi.org/10.5281/zenodo.5753476
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).

    WikiMed

    Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.

    WikiMed contains:

    • 393,618 Wikipedia page texts
    • 1,067,083 mentions of medical concepts
    • 57,739 unique UMLS CUIs

    Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.

    PubMedDS

    Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.

    PubMedDS contains:

    • 13,197,430 abstract texts
    • 57,943,354 medical concept mentions
    • 44,881 unique UMLS CUIs

    Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.

    Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.

    Data format

    Both datasets use JSON format with one document per line. Each document has the following structure:

    {
      "_id": "A unique identifier of each document",
      "text": "Contains text over which mentions are ",
      "title": "Title of Wikipedia/PubMed Article",
      "split": "[Not in PubMedDS] Dataset split: 

  6. H

    Business Process Reengineering (Normalized)

    • dataverse.harvard.edu
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomar Anez; Dimar Anez (2025). Business Process Reengineering (Normalized) [Dataset]. http://doi.org/10.7910/DVN/QBP0E9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Diomar Anez; Dimar Anez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides processed and normalized/standardized indices for the management tool 'Business Process Reengineering' (BPR). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding BPR dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "business process reengineering" + "process reengineering" + "reengineering management". Processing: None. The dataset utilizes the original Google Trends index, which is base-100 normalized against the peak search interest for the specified terms and period. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Reengineering + Business Process Reengineering + Process Reengineering. Processing: The annual relative frequency series was normalized by setting the year with the maximum value to 100 and scaling all other values (years) proportionally. Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching BPR-related keywords [("business process reengineering" OR ...) AND ("management" OR ...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly publication counts in Crossref. Data deduplicated via DOIs. Processing: For each month, the relative share of BPR-related publications (BPR Count / Total Crossref Count for that month) was calculated. This monthly relative share series was then normalized by setting the month with the maximum relative share to 100 and scaling all other months proportionally. Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Reengineering (1993, 1996, 2000, 2002); Business Process Reengineering (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Reengineering" and "Business Process Reengineering" were treated as a single conceptual series for BPR. Normalization: The combined series of original usability percentages was normalized relative to its own highest observed historical value across all included years (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Reengineering (1993, 1996, 2000, 2002); Business Process Reengineering (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Reengineering" and "Business Process Reengineering" were treated as a single conceptual series for BPR. Standardization (Z-scores): Original scores (X) were standardized using Z = (X - ?) / ?, with a theoretically defined neutral mean ?=3.0 and an estimated pooled population standard deviation ??0.891609 (calculated across all tools/years relative to ?=3.0). Index Scale Transformation: Z-scores were transformed to an intuitive index via: Index = 50 + (Z * 22). This scale centers theoretical neutrality (original score: 3.0) at 50 and maps the approximate range [1, 5] to [?1, ?100]. Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding BPR dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  7. d

    Temperature Normalized Enhanced Vegetation Index for Dixie Valley, Churchill...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Temperature Normalized Enhanced Vegetation Index for Dixie Valley, Churchill County, Nevada [Dataset]. https://catalog.data.gov/dataset/temperature-normalized-enhanced-vegetation-index-for-dixie-valley-churchill-county-nevada
    Explore at:
    Dataset updated
    Nov 30, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Churchill County, Nevada, Dixie Valley
    Description

    With increasing population growth and land-use change, urban communities in the desert southwest are progressively looking to remote basins to supplement existing water supplies. Recent applications for groundwater appropriations from Dixie Valley, Nevada, a primarily undeveloped basin neighboring the Carson Desert to the east, have prompted a reevaluation of the quantity of naturally discharging groundwater.The objective of this study was to develop a new, independent estimate of groundwater discharge by evapotranspiration (ET) from Dixie Valley using a combination of eddy-covariance evapotranspiration measurements and multispectral satellite imagery. Mean annual groundwater ET (ETg) was estimated during October 2009-2011 at four eddy covariance sites. Two sites were located in phreatophytic shrubland dominated by greasewood and two were located on a playa. Estimates were scaled to the basin level by combining remotely sensed imagery with field reconnaissance and site-scale ETg estimates.The Enhanced Vegetation Index (EVI) was calculated for 10 Landsat 5 Thematic mapper scenes and combined with brightness temperature in an effort to reduce confounding (high) EVI values resulting from forbes and cheat grass in sparsely vegetated areas, and biological soil crusts from bare soil to densely vegetated areas. The resulting EVI/TB images represented by this dataset were used to calculate ET units and scale actual and potential ETg to the basin level.

  8. H

    Supply Chain Management (Normalized)

    • dataverse.harvard.edu
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomar Anez; Dimar Anez (2025). Supply Chain Management (Normalized) [Dataset]. http://doi.org/10.7910/DVN/WNB7AY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Diomar Anez; Dimar Anez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides processed and normalized/standardized indices for the management tool group 'Supply Chain Management' (SCM), including related concepts like Supply Chain Integration. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding SCM dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "supply chain management" + "supply chain logistics" + "supply chain". Processing: None. The dataset utilizes the original Google Trends index, which is base-100 normalized against the peak search interest for the specified terms and period. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Supply Chain Management + Supply Chain Integration + Supply Chain. Processing: The annual relative frequency series was normalized by setting the year with the maximum value to 100 and scaling all other values (years) proportionally. Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching SCM-related keywords [("supply chain management" OR ...) AND ("management" OR ...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly publication counts in Crossref. Data deduplicated via DOIs. Processing: For each month, the relative share of SCM-related publications (SCM Count / Total Crossref Count for that month) was calculated. This monthly relative share series was then normalized by setting the month with the maximum relative share to 100 and scaling all other months proportionally. Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Supply Chain Integration (1999, 2000, 2002); Supply Chain Management (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Supply Chain Integration" and "Supply Chain Management" were treated as a single conceptual series for SCM. Normalization: The combined series of original usability percentages was normalized relative to its own highest observed historical value across all included years (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Supply Chain Integration (1999, 2000, 2002); Supply Chain Management (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Supply Chain Integration" and "Supply Chain Management" were treated as a single conceptual series for SCM. Standardization (Z-scores): Original scores (X) were standardized using Z = (X - ?) / ?, with ?=3.0 and ??0.891609. Index Scale Transformation: Z-scores were transformed via Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding SCM dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  9. Sample dataset for the models trained and tested in the paper 'Can AI be...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

    This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

    To use the data, clone the corresponding repository and unzip this zip file in the data folder.

  10. m

    Data from: Isobaric labeling update in MaxQuant

    • data.mendeley.com
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniela Ferretti (2024). Isobaric labeling update in MaxQuant [Dataset]. http://doi.org/10.17632/s3gfmcbghm.1
    Explore at:
    Dataset updated
    Oct 1, 2024
    Authors
    Daniela Ferretti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present an update of the MaxQuant software for isobaric labeling data and evaluate its performance on benchmark datasets. Impurity correction factors can be applied to labels mixing C- and N-type reporter ions, such as TMT Pro. Application to a single-cell species mixture benchmark shows high accuracy of the impurity-corrected results. TMT data recorded with FAIMS separation can be analyzed directly in MaxQuant without splitting the raw data into separate files per FAIMS voltage. Weighted median normalization, is applied to several datasets, including large-scale human body atlas data. In the benchmark datasets the weighted median normalization either removes or strongly reduces the batch effects between different TMT plexes and results in clustering by biology. In datasets including a reference channel, we find that weighted median normalization performs as well or better when the reference channel is ignored and only the sample channel intensities are used, suggesting that the measurement of a reference channel is unnecessary when using weighted median normalization in MaxQuant. We demonstrate that MaxQuant including the weighted median normalization performs well on multi-notch MS3 data, as well as on phosphorylation data.

    Data Summary: Each folder contains MaxQuant output tables used for data analysis with their respectively mqpar files. Please use the MaxQuant version specified in each dataset to open mqpar files. Perseus sessions are provided when Perseus was used for downstream analyses. Please use Perseus version Perseus version 2.1.2 to load the sessions.

  11. Z

    PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wu, Dongqi (2021). PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids (Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5130611
    Explore at:
    Dataset updated
    Nov 10, 2021
    Dataset provided by
    Wu, Dongqi
    Zheng, Xiangtian
    Trinh, Loc
    Huang,Tong
    Liu, Yan
    Sivaranjani, S
    Xu, Nan
    Xie, Le
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The electric grid is a key enabling infrastructure for the ambitious transition towards carbon neutrality as we grapple with climate change. With deepening penetration of renewable energy resources and electrified transportation, the reliable and secure operation of the electric grid becomes increasingly challenging. In this paper, we present PSML, a first-of-its-kind open-access multi-scale time-series dataset, to aid in the development of data-driven machine learning (ML) based approaches towards reliable operation of future electric grids. The dataset is generated through a novel transmission + distribution (T+D) co-simulation designed to capture the increasingly important interactions and uncertainties of the grid dynamics, containing electric load, renewable generation, weather, voltage and current measurements at multiple spatio-temporal scales. Using PSML, we provide state-of-the-art ML baselines on three challenging use cases of critical importance to achieve: (i) early detection, accurate classification and localization of dynamic disturbance events; (ii) robust hierarchical forecasting of load and renewable energy with the presence of uncertainties and extreme events; and (iii) realistic synthetic generation of physical-law-constrained measurement time series. We envision that this dataset will enable advances for ML in dynamic systems, while simultaneously allowing ML researchers to contribute towards carbon-neutral electricity and mobility.

    Data Navigation

    Please download, unzip and put somewhere for later benchmark results reproduction and data loading and performance evaluation for proposed methods.

    wget https://zenodo.org/record/5130612/files/PSML.zip?download=1 7z x 'PSML.zip?download=1' -o./

    Minute-level Load and Renewable

    File Name

    ISO_zone_#.csv: CAISO_zone_1.csv contains minute-level load, renewable and weather data from 2018 to 2020 in the zone 1 of CAISO.

    • Field Description

    Field time: Time of minute resolution.

    Field load_power: Normalized load power.

    Field wind_power: Normalized wind turbine power.

    Field solar_power: Normalized solar PV power.

    Field DHI: Direct normal irradiance.

    Field DNI: Diffuse horizontal irradiance.

    Field GHI: Global horizontal irradiance.

    Field Dew Point: Dew point in degree Celsius.

    Field Solar Zeinth Angle: The angle between the sun's rays and the vertical direction in degree.

    Field Wind Speed: Wind speed (m/s).

    Field Relative Humidity: Relative humidity (%).

    Field Temperature: Temperature in degree Celsius.

    Minute-level PMU Measurements

    File Name

    case #: The case 0 folder contains all data of scenario setting #0.

    pf_input_#.txt: Selected load, renewable and solar generation for the simulation.

    pf_result_#.csv: Voltage at nodes and power on branches in the transmission system via T+D simualtion.

    Filed Description

    Field time: Time of minute resolution.

    Field Vm_###: Voltage magnitude (p.u.) at the bus ### in the simulated model.

    Field Va_###: Voltage angle (rad) at the bus ### in the simulated model.

    Field P_#_#_#: P_3_4_1 means the active power transferring in the #1 branch from the bus 3 to 4.

    Field Q_#_#_#: Q_5_20_1 means the reactive power transferring in the #1 branch from the bus 5 to 20.

    Millisecond-level PMU Measurements

    File Name

    Forced Oscillation: The folder contains all forced oscillation cases.

    row_#: The folder contains all data of the disturbance scenario #.

    dist.csv: Three-phased voltage at nodes in the distribution system via T+D simualtion.

    info.csv: This file contains the start time, end time, location and type of the disturbance

    trans.csv: Voltage at nodes and power on branches in the transmission system via T+D simualtion.

    Natural Oscillation: The folder contains all natural oscillation cases.

    row_#: The folder contains all data of the disturbance scenario #.

    dist.csv: Three-phased voltage at nodes in the distribution system via T+D simualtion.

    info.csv: This file contains the start time, end time, location and type of the disturbance.

    trans.csv: Voltage at nodes and power on branches in the transmission system via T+D simualtion.

    Filed Description

    trans.csv

    • Field Time(s): Time of millisecond resolution.

    • Field VOLT ###: Voltage magnitude (p.u.) at the bus ### in the transmission model.

    • Field POWR ### TO ### CKT #: POWR 151 TO 152 CKT '1 ' means the active power transferring in the #1 branch from the bus 151 to 152.

    • Field VARS ### TO ### CKT #: VARS 151 TO 152 CKT '1 ' means the reactive power transferring in the #1 branch from the bus 151 to 152.

    dist.csv

    Field Time(s): Time of millisecond resolution.

    Field ####.###.#: 3005.633.1 means per-unit voltage magnitude of the phase A at the bus 633 of the distribution grid, the one connecting to the bus 3005 in the transmission system.

  12. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v2.0.0
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  13. e

    SPECMAP time scale developed by Imbrie et al., 1984 based on normalized...

    • b2find.eudat.eu
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). SPECMAP time scale developed by Imbrie et al., 1984 based on normalized planktonic records (normalized O-18 vs time, specmap.017) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/159698fd-cbe7-51ca-80f2-a7a97f5b7219
    Explore at:
    Dataset updated
    Oct 21, 2023
    Description

    This chronology is the basic control for the age models developed in Imbrie et al., 1989 and McIntyre et al., 1989.

  14. m

    KU-BdSL: Khulna University Bengali Sign Language dataset

    • data.mendeley.com
    Updated Jul 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Al Jaid Jim (2023). KU-BdSL: Khulna University Bengali Sign Language dataset [Dataset]. http://doi.org/10.17632/scpvm2nbkm.4
    Explore at:
    Dataset updated
    Jul 28, 2023
    Authors
    Abdullah Al Jaid Jim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Khulna
    Description

    The KU-BdSL refers to a Bengali sign language dataset, which includes three variants of the data. The variants are - (i) Uni-scale Sign Language Dataset (USLD), (ii) Multi-scale Sign Language Dataset (MSLD), and (iii) Annotated Multi-scale Sign Language Dataset (AMSLD). The dataset consists of images representing single-hand gestures for BdSL alphabets. Several smartphones are taken into account to capture images from 39 participants (30 males and 9 females). These 39 participants associated with the dataset creation have not offered any financial benefit. Each version includes 30 classes that resemble the 38 consonants ('shoroborno') of Bengali alphabets. There is a total of 1,500 images in jpg format in each variant. The images are captured on flat surfaces at different times of the day to vary the brightness and contrast. Class names are Unicode values corresponding to the Bengali alphabets for USLD and MSLD.

    Folder Names: 2433 -> ‘Chandra Bindu’ 2434 -> ‘Anusshar’ 2435 -> ‘Bisharga’ 2453 -> ‘Ka’ 2454 -> ‘Kha’ 2455 -> ‘Ga’ 2456 -> ‘Gha’ 2457 -> ‘Uo’ 2458 -> ‘Ca’ 2459 -> ‘Cha’ 2460-2479 -> ‘Borgio Ja/Anta Ja’ 2461 -> ‘Jha’ 2462 -> ‘Yo’ 2463 -> ‘Ta’ 2464 -> ‘Tha’ 2465 -> ‘Da’ 2466 -> ‘Dha’ 2467-2472 -> ‘Murdha Na/Donto Na’ 2468-2510 -> ‘ta/Khanda ta’ 2469 -> ‘tha’ 2470 -> ‘da’ 2471 -> ‘dha’ 2474 -> ‘pa’ 2475 -> ‘fa’ 2476-2477 -> ‘Ba/Bha’ 2478 -> ‘Ma’ 2480-2524-2525 -> ‘Ba-y Ra/Da-y Ra/Dha-y Ra’ 2482 -> ‘La’ 2486-2488-2487 -> ‘Talobbo sha/Danta sa/Murdha Sha’ 2489 -> ‘Ha’

    USLD: USLD has a unique size for all the images that is 512*512 pixels. The intended hand position is placed in the middle of the majority of cases in this dataset. MSLD: The raw images are stored in MSLD so that researchers can make changes to the dataset. The use of various smartphones yields us a wide variety of image sizes. AMSLD: AMSLD has multi-scale annotated data, which is suitable for tasks like localization and classification. From many annotation formats, the YOLO DarkNet annotation has been selected. Each image has an annotation text file containing five numbers separated by white space. The initial number is an integer, and the rest are floating numbers. The first number of the file indicates the class ID corresponding to the label of that image. Class IDs are mapped in a separate text file named 'obj.names'. The second and third values are the beginning normalized coordinates, while the fourth and fifth define the bounding box's normalized width and height.

    This dataset is supported by Research and Innovation Center, Khulna University, Khulna-9208, Bangladesh and all the data from this dataset is free to download, modify, and use. The previous version (Version 1) of this dataset contains the oral permission of the volunteers, and the rest versions have written consent of the participants. Therefore, we encourage researchers to use these versions (Version 2 or Version 3 or Version 4) for research objective.

  15. S

    30 m-scale Annual Global Normalized Difference Urban Index Datasets from...

    • scidb.cn
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Di Liu; Qingling Zhang (2023). 30 m-scale Annual Global Normalized Difference Urban Index Datasets from 2000 to 2021 [Dataset]. http://doi.org/10.57760/sciencedb.07081
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Di Liu; Qingling Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Urban areas play a very important role in global climate change. There is an increasing interest in comprehending global urban areas with adequate geographic details for global climate change mitigation. Accurate and frequent urban area information is fundamental to comprehending urbanization processes and land use/cover change, as well as the impact of global climate and environmental change. Defense Meteorological Satellite Program/Operational Line Scan System (DMSP/OLS) night-light (NTL) imagery contributes powerfully to the spatial characterization of global cities, however, its application potential is seriously limited by its coarse resolution. In this paper, we generate annual Normalized Difference Urban Index (NDUI) to characterize global urban areas at a 30 m-resolution from 2000 to 2021 by combining Landsat-5,7,8 Normalized Difference Vegetation Index (NDVI) composites and DMSP/OLS NTL images on the Google Earth Engine (GEE) platform. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI datasets have the potential for urbanization studies at regional and global scales.

  16. d

    Monthly normalized scaled surface wind friction velocity in the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Aug 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Monthly normalized scaled surface wind friction velocity in the Uinta-Piceance Basin (UT, CO) for March to October from 2001 to 2016 [Dataset]. https://catalog.data.gov/dataset/monthly-normalized-scaled-surface-wind-friction-velocity-in-the-uinta-piceance-basin-ut-co
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    These data were compiled to conduct a study on the effect of oil and gas development on dust emission potential in the Upper Colorado River Basin. The objectives of the study were to 1) assess the effect of oil and gas development on surface roughness, and 2) model the resultant effect of oil and gas development on sediment mass flux for a range of 10-m wind speeds and threshold wind friction velocities. These data represent monthly means of normalized scaled surface wind friction velocity - a metric modelled from land surface albedo that approximates surface roughness - of March to October of the years 2001 to 2016. These data are representative of disturbed and undisturbed sites located in five climate-soil strata in the Uinta-Piceance Basin. These data were modelled from the MODIS MCD43A3 (v061) product (Schaaf and Wang 2015). Data processing and modeling was conducted in Google Earth Engine. These data can be used to understand how oil and gas development have changed surface conditions and dust emission risk in the Uintah-Piceance Basin over a 16-year period.

  17. D

    Supplemental data for "Spectral Normalization and Voigt–Reuss net: A...

    • darus.uni-stuttgart.de
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanath Keshav; Julius Herb; Felix Fritzen (2025). Supplemental data for "Spectral Normalization and Voigt–Reuss net: A universal approach to microstructure‐property forecasting with physical guarantees" [Dataset]. http://doi.org/10.18419/DARUS-5120
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    DaRUS
    Authors
    Sanath Keshav; Julius Herb; Felix Fritzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Ministry of Science, Research, and the Arts (MWK) Baden-Württemberg
    Description

    This repository contains supplemental data for the article "Spectral Normalization and Voigt-Reuss net: A universal approach to microstructure‐property forecasting with physical guarantees", accepted for publication in GAMM-Mitteilungen by Sanath Keshav, Julius Herb, and Felix Fritzen [1]. The data contained in this DaRUS repository acts as an extension to the GitHub repository for the so-called Voigt-Reuss net. The data in this dataset is generated by solving thermal homogenization problems for an abundance of different microstructures. The microstructures are defined by periodic representative volume elements (RVE) and periodic boundary conditions are applied to the temperature fluctuations. We consider bi-phasic two-dimensional microstructures with a resolution of 400 × 400 pixels, as published in [2], and three-dimensional microstructures with a resolution of 192 × 192 × 192 voxels, as published in [3]. For both microstructure datasets, we provide the effective thermal conductivity tensor that is obtained by solving homogenization problems on the full microstructure for different material parameters in the two phases. For the simulation, we used our implementation of Fourier-Accelerated Nodal Solvers (FANS, [4]) that is based on a Finite Element Method (FEM) discretization. Further details are provided in the README.md file of this dataset, in our manuscript [1], and in the GitHub repository. [1] Keshav, S., Herb, J., and Fritzen, F. (2025). Spectral Normalization and Voigt–Reuss net: A universal approach to microstructure‐property forecasting with physical guarantees, GAMM‐Mitteilungen. (2025), e70005. https://doi.org/10.1002/gamm.70005 [2] Lißner, J. (2020). 2d microstructure data (Version V2) [dataset]. DaRUS. https://doi.org/doi:10.18419/DARUS-1151 [3] Prifling, B., Röding, M., Townsend, P., Neumann, M., and Schmidt, V. (2020). Large-scale statistical learning for mass transport prediction in porous materials using 90,000 artificially generated microstructures [dataset]. Zenodo. https://doi.org/10.5281/zenodo.4047774 [4] Leuschner, M., and Fritzen, F. (2018). Fourier-Accelerated Nodal Solvers (FANS) for homogenization problems. Computational Mechanics, 62(3), 359-392. https://doi.org/10.1007/s00466-017-1501-5

  18. H

    Mergers and Acquisitions (M&A) (Normalized)

    • dataverse.harvard.edu
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomar Anez; Dimar Anez (2025). Mergers and Acquisitions (M&A) (Normalized) [Dataset]. http://doi.org/10.7910/DVN/5PMQ3K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Diomar Anez; Dimar Anez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides processed and normalized/standardized indices for the management activity 'Mergers and Acquisitions' (M&A). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding M&A dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "mergers and acquisitions" + "mergers and acquisitions corporate". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Mergers and Acquisitions + Mergers & Acquisitions. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching M&A-related keywords [("mergers and acquisitions" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (M&A Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Mergers and Acquisitions (2006, 2008, 2010, 2012, 2014, 2017). Note: Not reported before 2006 or after 2017. Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Mergers and Acquisitions (2006-2017). Note: Not reported before 2006 or after 2017. Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding M&A dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  19. S

    30 m-scale Annual Global Normalized Difference Urban Index Datasets from...

    • scidb.cn
    Updated Mar 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Di Liu; Yifang Wang; Xi Li; Qingling Zhang (2022). 30 m-scale Annual Global Normalized Difference Urban Index Datasets from 2000 to 2013 [Dataset]. http://doi.org/10.11922/sciencedb.01625
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Di Liu; Yifang Wang; Xi Li; Qingling Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Urban areas play a very important role in global climate change. There is an increasing interest in comprehending global urban areas with adequate geographic details for global climate change mitigation. Accurate and frequent urban area information is fundamental to comprehend urbanization processes and land use/cover change, as well as the impact of global climate and environmental change. Defense Meteorological Satellite Program/Operational Line Scan System (DMSP/OLS) night-light (NTL) imagery contributes powerfully to the spatial characterization of global cities, however, its application potential is seriously limited by its coarse resolution. In this paper, we generate annual Normalized Difference Urban Index (NDUI) to characterize global urban areas at a 30 m-resolution from 2000 to 2013 by combining Landsat-7 Normalized Difference Vegetation Index (NDVI) composites and DMSP/OLS NTL images on the Google Earth Engine (GEE) platform. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI datasets have the potential for urbanization studies at regional and global scales.

  20. h

    subsampled_low_res

    • huggingface.co
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Learning the Earth with AI and Physics (2024). subsampled_low_res [Dataset]. https://huggingface.co/datasets/LEAP/subsampled_low_res
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 17, 2024
    Dataset authored and provided by
    Learning the Earth with AI and Physics
    Description

    Inputs and targets in this dataset are pre-normalized and scaled with .nc files found on the GitHub repo: https://github.com/leap-stc/ClimSim/tree/main/preprocessing/normalizations Read more: https://arxiv.org/abs/2306.08754.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chantha Wongoutong (2024). Binary classification using a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t002

Binary classification using a confusion matrix.

Related Article
Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Dec 6, 2024
Dataset provided by
PLOS ONE
Authors
Chantha Wongoutong
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

Search
Clear search
Close search
Google apps
Main menu