100+ datasets found
  1. Characteristics of the ecocentric vs. social-ecological clusters identified...

    • plos.figshare.com
    xls
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière (2023). Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs). [Dataset]. http://doi.org/10.1371/journal.pone.0272223.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).

  2. f

    Data from: Improvements of birefringence imaging techniques to observe...

    • tandf.figshare.com
    pdf
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kensei Toyoda; Hirotaka Manaka; Yoko Miura (2025). Improvements of birefringence imaging techniques to observe stress-induced ferroelectricity in SrTiO3 based on K-means clustering with circular statistics [Dataset]. http://doi.org/10.6084/m9.figshare.24512017.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 8, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Kensei Toyoda; Hirotaka Manaka; Yoko Miura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optical birefringence imaging techniques have enabled quantitative evaluation of macroscopic structures, e.g. domains and grain boundaries. With inhomogeneous samples, the selection of regions for analysis can significantly affect the conclusions; thus, arbitrary selection can lead to inaccurate findings. Thus, in this study, we present a method to cluster all birefringence imaging data using K-means multivariate clustering on a pixel-by-pixel basis to eliminate arbitrariness in the region selection process. Linear statistics cannot be applied to the polarization states of light described by angles and their periodicity; thus, circular statistics are used for clustering. By applying this approach to a 42,280-pixel image comprising 12 explanatory variables of stress-induced ferroelectricity in SrTiO3, we were able to select a region of locally developed spontaneous polarization. This region covers only 1.9% of the total area, where the stress and/or strain is concentrated, thereby resulting in a higher ferroelectric phase transition temperature and larger spontaneous polarization than in the other regions. The K-means multivariate clustering with circular statistics is shown to be a powerful tool to eliminate arbitrariness. The proposed method is a significant analysis technique that can be applied to images using the polarization of light, azimuthal angle of crystals, scattering angle. Multimodal birefringence imaging data represented by angles and their periodicities were clustered by K-means method using circular statistics. We successfully selected large spontaneous polarization area without arbitrariness in analysis process.

  3. d

    Data from: \"Size\" and \"shape\" in the measurement of multivariate...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Greenacre (2025). \"Size\" and \"shape\" in the measurement of multivariate proximity [Dataset]. http://doi.org/10.5061/dryad.6r5j8
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Michael Greenacre
    Time period covered
    Mar 16, 2018
    Description
    1. Ordination and clustering methods are widely applied to ecological data that are nonnegative, for example species abundances or biomasses. These methods rely on a measure of multivariate proximity that quantifies differences between the sampling units (e.g. individuals, stations, time points), leading to results such as: (i) ordinations of the units, where interpoint distances optimally display the measured differences; (ii) clustering the units into homogeneous clusters; or (iii) assessing differences between pre-specified groups of units (e.g., regions, periods, treatment-control groups). 2. These methods all conceal a fundamental question: To what extent are the differences between the sampling units, computed according to the chosen proximity function, capturing the "size" in the multivariate observations, or their "shape"? "Size" means the overall level of the measurements: for example, some samples contain higher total abundances or more biomass, others less. "Shape" mea...
  4. Congruence (C) and spatial clustering results for fish network modules and...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel J. McGarvey; Joseph A. Veech (2023). Congruence (C) and spatial clustering results for fish network modules and partitioning around medoids (PAM) clusters. [Dataset]. http://doi.org/10.1371/journal.pone.0208720.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daniel J. McGarvey; Joseph A. Veech
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Congruence (C) and spatial clustering results for fish network modules and partitioning around medoids (PAM) clusters.

  5. n

    Data from: Identifying stationary phases in multivariate time series for...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Dec 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rémi Patin; Marie-Pierre Etienne; Emilie Lebarbier; Simon Benhamou; Simon Chamaillé‐Jammes (2019). Identifying stationary phases in multivariate time series for highlighting behavioural modes and home range settlements [Dataset]. http://doi.org/10.5061/dryad.2j63369
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 2, 2019
    Dataset provided by
    Centre d'Écologie Fonctionnelle et Évolutive
    ,
    Institut de recherche mathématique de Rennes
    Authors
    Rémi Patin; Marie-Pierre Etienne; Emilie Lebarbier; Simon Benhamou; Simon Chamaillé‐Jammes
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Hwange National Park
    Description
    1. Recent advances in bio-logging open promising perspectives in the study of animal movements at numerous scales. It is now possible to record time-series of animal locations and ancillary data (e.g. activity level derived from on-board accelerometers) over extended areas and long durations with a high spatial and temporal resolution. Such time-series are often piecewise stationary, as the animal may alternate between different stationary phases (i.e. characterised by a specific mean and variance of some key parameter for limited periods). Identifying when these phases start and end is a critical first step to understand the dynamics of the underlying movement processes.
    2. We introduce a new segmentation-clustering method we called segclust2d (available as a R package at cran.r-project.org/package=segclust2d). It can segment bi- (or more generally multi-) variate time-series and possibly cluster the various segments obtained, corresponding to different phases assumed to be stationary. This method is easy to use, as it only requires specifying a minimum segment length (to prevent over-segmentation), based on biological rather than statistical considerations.
    3. This method can be applied to bivariate piecewise time-series of any nature. We focus here on two types of time-series related to animal movement, corresponding to (i) at large scale, series of bivariate coordinates of relocations, to highlight temporary home ranges, and (ii) at smaller scale, bivariate series derived from relocations data, such as speed and turning angle, to highlight different behavioural modes such as transit, feeding and resting.
    4. Using computer simulations, we show that segclust2d can rival and even outperform previous, more complex methods, which were specifically developed to highlight changes in movement modes or home range shifts (based on Hidden Markov and Ornstein-Uhlenbeck modelling), which, contrary to our method, usually require the user to provide relevant initial guesses to be efficient. Furthermore we demonstrate it on actual examples involving a zebra's small scale movements and an elephant's large scale movements, to illustrate how various movement modes and home range shifts, respectively, can be identified. 15-Aug-2019
  6. f

    Pathologies in which cluster assignment had a significant influence in the...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Migliore, Daniela Paula; del Campo, Pablo Gómez; Sanz-Martín, Guillermo; del Castillo-Izquierdo, José; Domínguez, Juan Manuel (2024). Pathologies in which cluster assignment had a significant influence in the clinical outcome according to the multivariate analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001324521
    Explore at:
    Dataset updated
    Nov 27, 2024
    Authors
    Migliore, Daniela Paula; del Campo, Pablo Gómez; Sanz-Martín, Guillermo; del Castillo-Izquierdo, José; Domínguez, Juan Manuel
    Description

    Pathologies in which cluster assignment had a significant influence in the clinical outcome according to the multivariate analysis.

  7. Cross-table with the number of elders (and row percentages) classified in...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Cristina Paulo; Adriana Sampaio; Nadine Correia Santos; Patrício Soares Costa; Pedro Cunha; Joseph Zihl; João Cerqueira; Joana Almeida Palha; Nuno Sousa (2023). Cross-table with the number of elders (and row percentages) classified in the four clusters originated by the multivariate regression tree (T-cluster) and by the K-means clustering (C-cluster) analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0024553.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ana Cristina Paulo; Adriana Sampaio; Nadine Correia Santos; Patrício Soares Costa; Pedro Cunha; Joseph Zihl; João Cerqueira; Joana Almeida Palha; Nuno Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cross-table with the number of elders (and row percentages) classified in the four clusters originated by the multivariate regression tree (T-cluster) and by the K-means clustering (C-cluster) analysis.

  8. Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    o; o
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 4, 2025
    Description

    Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

    This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

    • ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

    • ESM_2.py – Python script to calculate Z-scores from raw financial ratios

    • ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

    • ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

    • ESM_5.xlsx – Mahalanobis distance values for each firm

    • ESM_6.py – Python script to compute Mahalanobis distances

    • ESM_7.py – Python script to visualize Mahalanobis distances

    • ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

    • ESM_9.py – Python script to compute mean Z-scores

    • ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

    • ESM_11.py – Python script to re-standardize mean Z-scores

    • ESM_12.py – Python script to generate the hierarchical clustering dendrogram

    All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

  9. f

    Database for spatial clustering analysis.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mandja, Bien-Aimé; Situakibanza, Hippolyte; Ozer, Pierre; Kayembe, Harry César; Linard, Catherine; Muwonga, Jérémie; Batumbo, Doudou; Bompangue, Didier; Moutschen, Michel; Matunga, Muriel (2023). Database for spatial clustering analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000955653
    Explore at:
    Dataset updated
    Aug 28, 2023
    Authors
    Mandja, Bien-Aimé; Situakibanza, Hippolyte; Ozer, Pierre; Kayembe, Harry César; Linard, Catherine; Muwonga, Jérémie; Batumbo, Doudou; Bompangue, Didier; Moutschen, Michel; Matunga, Muriel
    Description

    BackgroundThe dynamics of the spread of cholera epidemics in the Democratic Republic of the Congo (DRC), from east to west and within western DRC, have been extensively studied. However, the drivers of these spread processes remain unclear. We therefore sought to better understand the factors associated with these spread dynamics and their potential underlying mechanisms.MethodsIn this eco-epidemiological study, we focused on the spread processes of cholera epidemics originating from the shores of Lake Kivu, involving the areas bordering Lake Kivu, the areas surrounding the lake areas, and the areas out of endemic eastern DRC (eastern and western non-endemic provinces). Over the period 2000–2018, we collected data on suspected cholera cases, and a set of several variables including types of conflicts, the number of internally displaced persons (IDPs), population density, transportation network density, and accessibility indicators. Using multivariate ordinal logistic regression models, we identified factors associated with the spread of cholera outside the endemic eastern DRC. We performed multivariate Vector Auto Regressive models to analyze potential underlying mechanisms involving the factors associated with these spread dynamics. Finally, we classified the affected health zones using hierarchical ascendant classification based on principal component analysis (PCA).FindingsThe increase in the number of suspected cholera cases, the exacerbation of conflict events, and the number of IDPs in eastern endemic areas were associated with an increased risk of cholera spreading outside the endemic eastern provinces. We found that the increase in suspected cholera cases was influenced by the increase in battles at lag of 4 weeks, which were influenced by the violence against civilians with a 1-week lag. The violent conflict events influenced the increase in the number of IDPs 4 to 6 weeks later. Other influences and uni- or bidirectional causal links were observed between violent and non-violent conflicts, and between conflicts and IDPs. Hierarchical clustering on PCA identified three categories of affected health zones: densely populated urban areas with few but large and longer epidemics; moderately and accessible areas with more but small epidemics; less populated and less accessible areas with more and larger epidemics.ConclusionOur findings argue for monitoring conflict dynamics to predict the risk of geographic expansion of cholera in the DRC. They also suggest areas where interventions should be appropriately focused to build their resilience to the disease.

  10. f

    Multivariate analysis (Cox regression with same patients clustering) for...

    • datasetcatalog.nlm.nih.gov
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haviv, Yosef S.; Shafat, Tali; Novack, Victor; Barski, Leonid (2023). Multivariate analysis (Cox regression with same patients clustering) for all-cause mortality (n = 556,595 tests of 105,207 patients). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000950333
    Explore at:
    Dataset updated
    Feb 21, 2023
    Authors
    Haviv, Yosef S.; Shafat, Tali; Novack, Victor; Barski, Leonid
    Description

    Multivariate analysis (Cox regression with same patients clustering) for all-cause mortality (n = 556,595 tests of 105,207 patients).

  11. w

    Upscaling soil organic carbon measurements at the continental scale using...

    • soilwise-he.containers.wur.nl
    Updated Jun 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Upscaling soil organic carbon measurements at the continental scale using multivariate clustering analysis and machine learning [Dataset]. http://doi.org/10.5281/zenodo.8057232
    Explore at:
    Dataset updated
    Jun 22, 2023
    Description

    Data Description: To improve SOC estimation in the United States, we upscaled site-based SOC measurements to the continental scale using multivariate geographic clustering (MGC) approach coupled with machine learning models. First, we used the MGC approach to segment the United States at 30 arc second resolution based on principal component information from environmental covariates (gNATSGO soil properties, WorldClim bioclimatic variables, MODIS biological variables, and physiographic variables) to 20 SOC regions. We then trained separate random forest model ensembles for each of the SOC regions identified using environmental covariates and soil profile measurements from the International Soil Carbon Network (ISCN) and an Alaska soil profile data. We estimated United States SOC for 0-30 cm and 0-100 cm depths were 52.6 + 3.2 and 108.3 + 8.2 Pg C, respectively. Files in collection (32): Collection contains 22 soil properties geospatial rasters, 4 soil SOC geospatial rasters, 2 ISCN site SOC observations csv files, and 4 R scripts gNATSGO TIF files: ├── available_water_storage_30arc_30cm_us.tif [30 cm depth soil available water storage]
    ├── available_water_storage_30arc_100cm_us.tif [100 cm depth soil available water storage]
    ├── caco3_30arc_30cm_us.tif [30 cm depth soil CaCO3 content]
    ├── caco3_30arc_100cm_us.tif [100 cm depth soil CaCO3 content]
    ├── cec_30arc_30cm_us.tif [30 cm depth soil cation exchange capacity]
    ├── cec_30arc_100cm_us.tif [100 cm depth soil cation exchange capacity]
    ├── clay_30arc_30cm_us.tif [30 cm depth soil clay content]
    ├── clay_30arc_100cm_us.tif [100 cm depth soil clay content]
    ├── depthWT_30arc_us.tif [depth to water table]
    ├── kfactor_30arc_30cm_us.tif [30 cm depth soil erosion factor]
    ├── kfactor_30arc_100cm_us.tif [100 cm depth soil erosion factor]
    ├── ph_30arc_100cm_us.tif [100 cm depth soil pH]
    ├── ph_30arc_100cm_us.tif [30 cm depth soil pH]
    ├── pondingFre_30arc_us.tif [ponding frequency]
    ├── sand_30arc_30cm_us.tif [30 cm depth soil sand content]
    ├── sand_30arc_100cm_us.tif [100 cm depth soil sand content]
    ├── silt_30arc_30cm_us.tif [30 cm depth soil silt content]
    ├── silt_30arc_100cm_us.tif [100 cm depth soil silt content]
    ├── water_content_30arc_30cm_us.tif [30 cm depth soil water content]
    └── water_content_30arc_100cm_us.tif [100 cm depth soil water content] SOC TIF files: ├──30cm SOC mean.tif [30 cm depth soil SOC]
    ├──100cm SOC mean.tif [100 cm depth soil SOC]
    ├──30cm SOC CV.tif [30 cm depth soil SOC coefficient of variation]
    └──100cm SOC CV.tif [100 cm depth soil SOC coefficient of variation] site observations csv files: ISCN_rmNRCS_addNCSS_30cm.csv 30cm ISCN sites SOC replaced NRCS sites with NCSS centroid removed data ISCN_rmNRCS_addNCSS_100cm.csv 100cm ISCN sites SOC replaced NRCS sites with NCSS centroid removed data
    Data format: Geospatial files are provided in Geotiff format in Lat/Lon WGS84 EPSG: 4326 projection at 30 arc second resolution. Geospatial projection:

    GEOGCS['GCS_WGS_1984', DATUM['D_WGS_1984', SPHEROID['WGS_1984',6378137,298.257223563]], PRIMEM['Greenwich',0], UNIT['Degree',0.017453292519943295]] (base) [jbk@theseus ltar_regionalization]$ g.proj -w GEOGCS['wgs84', DATUM['WGS_1984', SPHEROID['WGS_1984',6378137,298.257223563]], PRIMEM['Greenwich',0], UNIT['degree',0.0174532925199433]] 

  12. Sea surface and the sea floor regionalization of the Southern Ocean by...

    • doi.pangaea.de
    html, tsv
    Updated 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kerstin Jerosch; Hendrik Pehlke; Katharina Teschke; Lukas Weber; Teresa Heidemann; Frauke Katharina Scharf (2016). Sea surface and the sea floor regionalization of the Southern Ocean by multivariate cluster analysis, links to ArcGIS project files [Dataset]. http://doi.org/10.1594/PANGAEA.856105
    Explore at:
    html, tsvAvailable download formats
    Dataset updated
    2016
    Dataset provided by
    PANGAEA
    Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven
    Authors
    Kerstin Jerosch; Hendrik Pehlke; Katharina Teschke; Lukas Weber; Teresa Heidemann; Frauke Katharina Scharf
    Area covered
    Variables measured
    File name, File size, File format, Uniform resource locator/link to file
    Description

    This study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.

  13. m

    Datasets and source code for a pipeline architecture for feature-based...

    • data.mendeley.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonatan Enes (2022). Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs [Dataset]. http://doi.org/10.17632/hgkv9cpnmn.2
    Explore at:
    Dataset updated
    Dec 14, 2022
    Authors
    Jonatan Enes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository is composed of 2 compressed files, with the contents as next described.

    --- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically:

    + prepare.py and main.py ⇨ 
      The Python programs that implement the pipeline, both the auxiliary and the main pipeline 
      stages, respectively. 
    
    + 'anomaly' and 'config' folders ⇨ 
      Scripts and Python files containing the configuration and some basic functions that are 
      used to retrieve the information needed to process the data, like the actual resource 
      time series from OpenTSDB, or the job metadata from Slurm.
    
    + 'functions' folder ⇨ 
      Several folders with the Python programs that implement all the stages of the pipeline, 
      either for the Machine Learning processing (e.g., extractors, aggregators, models), or 
      the technical aspect of the pipeline (e.g., pipelines, transformer).
    
    + plotDF.py ⇨ 
      A Python program used to create the different plots presented, from the resource time 
      series to the evaluation plots.
    
    + several bash scripts ⇨ 
      Used to run the experiments using a specific configuration, whether regarding which 
      transformers are chosen and how they are parametrized, or more technical aspects 
      involving how the pipeline is executed.
    

    --- data.tar.gz --- The actual data and results, organized as follows:

    + jobs ⇨ 
      All the jobs' resource time series plots for all the experiments, with a folder used 
      for each experiment. Inside each folder all the jobs are separated according to their 
      id, containing the plots for the different system resources (e.g., User CPU, Cached memory).
    
    + plots ⇨ 
      All the predictions' plots for all the experiments in separated folders, mainly used for 
      evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These 
      plots are available for all the predictors resulting from the pipeline execution. In 
      addition, for each predictor it is also possible to visualize the resource time series 
      grouped by clusters. Finally, the projections as generated by the dimension reduction 
      models, and the outliers detected, are also available for each experiment.
    
    + datasets ⇨ 
      The datasets used for the experiments, which include the lists of job IDs to be processed 
      (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), 
      and the output text files as generated by several pipeline stages. Among these latter 
      files it is worth to note the evaluation ones, that include all the predictions scores.
    
  14. f

    Cluster analysis.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Almagro, Pere; Malik, Komal; Martinez-Urrea, Ana; Molina, Siena; Martínez-Camblor, Pablo; Libori, Ginebra; Monzon, Helena (2023). Cluster analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001028104
    Explore at:
    Dataset updated
    Jun 2, 2023
    Authors
    Almagro, Pere; Malik, Komal; Martinez-Urrea, Ana; Molina, Siena; Martínez-Camblor, Pablo; Libori, Ginebra; Monzon, Helena
    Description

    BackgroundData about long-term prognosis after hospitalisation of elderly multimorbid patients remains scarce.ObjectivesEvaluate medium and long-term prognosis in hospitalised patients older than 75 years of age with multimorbidity. Explore the impact of gender, age, frailty, physical dependence, and chronic diseases on mortality over a seven-year period.MethodsWe included prospectively all patients hospitalised for medical reasons over 75 years of age with two or more chronic illnesses in a specialised ward. Data on chronic diseases were collected using the Charlson comorbidity index and a questionnaire for disorders not included in this index. Demographic characteristics, Clinical Frailty Scale, Barthel index, and complications during hospitalisation were collected.Results514 patients (46% males) with a mean age of 85 (± 5) years were included. The median follow-up was 755 days (interquartile range 25–75%: 76–1,342). Mortality ranged from 44% to 68%, 82% and 91% at one, three, five, and seven years. At inclusion, men were slightly younger and with lower levels of physical impairment. Nevertheless, in the multivariate analysis, men had higher mortality (p<0.001; H.R.:1.43; 95% C.I.95%:1.16–1.75). Age, Clinical Frailty Scale, Barthel, and Charlson indexes were significant predictors in the univariate and multivariate analysis (all p<0.001). Dementia and neoplastic diseases were statistically significant in the unadjusted but not the adjusted model. In a cluster analysis, three patterns of patients were identified, with increasing significant mortality differences between them (p<0.001; H.R.:1.67; 95% CI: 1.49–1.88).ConclusionsIn our cohort, individual diseases had a limited predictive prognostic capacity, while the combination of chronic illness, frailty, and physical dependence were independent predictors of survival.

  15. Data from: Multivariate analysis of FcR-mediated NK cell functions...

    • data.niaid.nih.gov
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Tuyishime; Rachel L. Spreng (2023). Multivariate analysis of FcR-mediated NK cell functions identifies unique clustering among humans and rhesus macaques - dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8436836
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Duke University Human Vaccine Institute
    Department of Surgery, Duke University, Durham, NC
    Authors
    Marina Tuyishime; Rachel L. Spreng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset from Tuyishime M, Spreng RL, et al. Multivariate analysis of FcR-mediated NK cell functions identifies unique clustering among humans and rhesus macaques. Frontiers in Immunology 2023 doi: 10.3389/fimmu.2023.1260377

  16. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  17. Clustering Iris Data Set

    • kaggle.com
    zip
    Updated Sep 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rifki Ilham (2023). Clustering Iris Data Set [Dataset]. https://www.kaggle.com/datasets/rifkiilham/clustering-iris-data-set/suggestions
    Explore at:
    zip(11024 bytes)Available download formats
    Dataset updated
    Sep 2, 2023
    Authors
    Rifki Ilham
    Description

    The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. Please use this data set to clustering the iris flowers data. You can use k-means clustering algorithm.

  18. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  19. d

    Representativeness-based Sampling Network Design for the State of Alaska

    • dataone.org
    • search.dataone.org
    • +2more
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forrest Hoffman; Jitendra Kumar; Richard Mills; William Hargrove (2024). Representativeness-based Sampling Network Design for the State of Alaska [Dataset]. http://doi.org/10.5440/1108686
    Explore at:
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    ESS-DIVE
    Authors
    Forrest Hoffman; Jitendra Kumar; Richard Mills; William Hargrove
    Time period covered
    Jun 1, 2013
    Area covered
    Description

    This data set collection consists of data products described in Hoffman et. al., 2013. Resource and logistical constraints limit the frequency and extent of environmental observations, particularly in the Arctic, necessitating the development of a systematic sampling strategy to maximize coverage and objectively represent environmental variability at desired scales. A quantitative methodology for stratifying sampling domains, informing site selection, and determining the representativeness of measurement sites and networks is described here. Multivariate spatiotemporal clustering was applied to down-scaled general circulation model results and data for the State of Alaska at 4 km2 resolution to define multiple sets of ecoregions across two decadal time periods. Maps of ecoregions for the present (2000-2009) and future (2090-2099) were produced, showing how combinations of 37 characteristics are distributed and how they may shift in the future. Representative sampling locations are identified on present and future ecoregion maps. A representativeness metric was developed, and representativeness maps for eight candidate sampling locations were produced. This metric was used to characterize the environmental similarity of each site. This analysis provides model-inspired insights into optimal sampling strategies, offers a framework for up-scaling measurements, and provides a down-scaling approach for integration of models and measurements. These techniques can be applied at different spatial and temporal scales to meet the needs of individual measurement campaigns. This dataset contains one zipped file, one .txt file, and one .sh file. The Next-Generation Ecosystem Experiments: Arctic (NGEE Arctic), was a research effort to reduce uncertainty in Earth System Models by developing a predictive understanding of carbon-rich Arctic ecosystems and feedbacks to climate. NGEE Arctic was supported by the Department of Energy's Office of Biological and Environmental Research. The NGEE Arctic project had two field research sites: 1) located within the Arctic polygonal tundra coastal region on the Barrow Environmental Observatory (BEO) and the North Slope near Utqiagvik (Barrow), Alaska and 2) multiple areas on the discontinuous permafrost region of the Seward Peninsula north of Nome, Alaska. Through observations, experiments, and synthesis with existing datasets, NGEE Arctic provided an enhanced knowledge base for multi-scale modeling and contributed to improved process representation at global pan-Arctic scales within the Department of Energy's Earth system Model (the Energy Exascale Earth System Model, or E3SM), and specifically within the E3SM Land Model component (ELM).

  20. f

    Factors associated with membership to behavioral clusters from multivariate...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doyle, Aoife M.; Newton, Charles R. J. C.; Williams, Thomas N.; Abubakar, Amina; Ssewanyana, Derrick; Nyutu, Gideon; Walumbe, David; Otiende, Mark; Nyaguara, Amek; Nyundo, Christopher; Amadi, David; Mochamah, George; Ross, David A.; Bauni, Evasius (2020). Factors associated with membership to behavioral clusters from multivariate stepwise ordinal logistic regression (n = 1,058). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000499985
    Explore at:
    Dataset updated
    Nov 12, 2020
    Authors
    Doyle, Aoife M.; Newton, Charles R. J. C.; Williams, Thomas N.; Abubakar, Amina; Ssewanyana, Derrick; Nyutu, Gideon; Walumbe, David; Otiende, Mark; Nyaguara, Amek; Nyundo, Christopher; Amadi, David; Mochamah, George; Ross, David A.; Bauni, Evasius
    Description

    Factors associated with membership to behavioral clusters from multivariate stepwise ordinal logistic regression (n = 1,058).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière (2023). Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs). [Dataset]. http://doi.org/10.1371/journal.pone.0272223.t001
Organization logo

Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).

Search
Clear search
Close search
Google apps
Main menu