Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optical birefringence imaging techniques have enabled quantitative evaluation of macroscopic structures, e.g. domains and grain boundaries. With inhomogeneous samples, the selection of regions for analysis can significantly affect the conclusions; thus, arbitrary selection can lead to inaccurate findings. Thus, in this study, we present a method to cluster all birefringence imaging data using K-means multivariate clustering on a pixel-by-pixel basis to eliminate arbitrariness in the region selection process. Linear statistics cannot be applied to the polarization states of light described by angles and their periodicity; thus, circular statistics are used for clustering. By applying this approach to a 42,280-pixel image comprising 12 explanatory variables of stress-induced ferroelectricity in SrTiO3, we were able to select a region of locally developed spontaneous polarization. This region covers only 1.9% of the total area, where the stress and/or strain is concentrated, thereby resulting in a higher ferroelectric phase transition temperature and larger spontaneous polarization than in the other regions. The K-means multivariate clustering with circular statistics is shown to be a powerful tool to eliminate arbitrariness. The proposed method is a significant analysis technique that can be applied to images using the polarization of light, azimuthal angle of crystals, scattering angle. Multimodal birefringence imaging data represented by angles and their periodicities were clustered by K-means method using circular statistics. We successfully selected large spontaneous polarization area without arbitrariness in analysis process.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Congruence (C) and spatial clustering results for fish network modules and partitioning around medoids (PAM) clusters.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
TwitterPathologies in which cluster assignment had a significant influence in the clinical outcome according to the multivariate analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cross-table with the number of elders (and row percentages) classified in the four clusters originated by the multivariate regression tree (T-cluster) and by the K-means clustering (C-cluster) analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.
This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:
ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)
ESM_2.py – Python script to calculate Z-scores from raw financial ratios
ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios
ESM_4.py – Python script for generating the correlation heatmap of the Z-scores
ESM_5.xlsx – Mahalanobis distance values for each firm
ESM_6.py – Python script to compute Mahalanobis distances
ESM_7.py – Python script to visualize Mahalanobis distances
ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)
ESM_9.py – Python script to compute mean Z-scores
ESM_10.xlsx – Re-standardized Z-scores based on firm-level means
ESM_11.py – Python script to re-standardize mean Z-scores
ESM_12.py – Python script to generate the hierarchical clustering dendrogram
All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Facebook
TwitterBackgroundThe dynamics of the spread of cholera epidemics in the Democratic Republic of the Congo (DRC), from east to west and within western DRC, have been extensively studied. However, the drivers of these spread processes remain unclear. We therefore sought to better understand the factors associated with these spread dynamics and their potential underlying mechanisms.MethodsIn this eco-epidemiological study, we focused on the spread processes of cholera epidemics originating from the shores of Lake Kivu, involving the areas bordering Lake Kivu, the areas surrounding the lake areas, and the areas out of endemic eastern DRC (eastern and western non-endemic provinces). Over the period 2000–2018, we collected data on suspected cholera cases, and a set of several variables including types of conflicts, the number of internally displaced persons (IDPs), population density, transportation network density, and accessibility indicators. Using multivariate ordinal logistic regression models, we identified factors associated with the spread of cholera outside the endemic eastern DRC. We performed multivariate Vector Auto Regressive models to analyze potential underlying mechanisms involving the factors associated with these spread dynamics. Finally, we classified the affected health zones using hierarchical ascendant classification based on principal component analysis (PCA).FindingsThe increase in the number of suspected cholera cases, the exacerbation of conflict events, and the number of IDPs in eastern endemic areas were associated with an increased risk of cholera spreading outside the endemic eastern provinces. We found that the increase in suspected cholera cases was influenced by the increase in battles at lag of 4 weeks, which were influenced by the violence against civilians with a 1-week lag. The violent conflict events influenced the increase in the number of IDPs 4 to 6 weeks later. Other influences and uni- or bidirectional causal links were observed between violent and non-violent conflicts, and between conflicts and IDPs. Hierarchical clustering on PCA identified three categories of affected health zones: densely populated urban areas with few but large and longer epidemics; moderately and accessible areas with more but small epidemics; less populated and less accessible areas with more and larger epidemics.ConclusionOur findings argue for monitoring conflict dynamics to predict the risk of geographic expansion of cholera in the DRC. They also suggest areas where interventions should be appropriately focused to build their resilience to the disease.
Facebook
TwitterMultivariate analysis (Cox regression with same patients clustering) for all-cause mortality (n = 556,595 tests of 105,207 patients).
Facebook
TwitterData Description: To improve SOC estimation in the United States, we upscaled site-based SOC measurements to the continental scale using multivariate geographic clustering (MGC) approach coupled with machine learning models. First, we used the MGC approach to segment the United States at 30 arc second resolution based on principal component information from environmental covariates (gNATSGO soil properties, WorldClim bioclimatic variables, MODIS biological variables, and physiographic variables) to 20 SOC regions. We then trained separate random forest model ensembles for each of the SOC regions identified using environmental covariates and soil profile measurements from the International Soil Carbon Network (ISCN) and an Alaska soil profile data. We estimated United States SOC for 0-30 cm and 0-100 cm depths were 52.6 + 3.2 and 108.3 + 8.2 Pg C, respectively. Files in collection (32): Collection contains 22 soil properties geospatial rasters, 4 soil SOC geospatial rasters, 2 ISCN site SOC observations csv files, and 4 R scripts gNATSGO TIF files: ├── available_water_storage_30arc_30cm_us.tif [30 cm depth soil available water storage]
├── available_water_storage_30arc_100cm_us.tif [100 cm depth soil available water storage]
├── caco3_30arc_30cm_us.tif [30 cm depth soil CaCO3 content]
├── caco3_30arc_100cm_us.tif [100 cm depth soil CaCO3 content]
├── cec_30arc_30cm_us.tif [30 cm depth soil cation exchange capacity]
├── cec_30arc_100cm_us.tif [100 cm depth soil cation exchange capacity]
├── clay_30arc_30cm_us.tif [30 cm depth soil clay content]
├── clay_30arc_100cm_us.tif [100 cm depth soil clay content]
├── depthWT_30arc_us.tif [depth to water table]
├── kfactor_30arc_30cm_us.tif [30 cm depth soil erosion factor]
├── kfactor_30arc_100cm_us.tif [100 cm depth soil erosion factor]
├── ph_30arc_100cm_us.tif [100 cm depth soil pH]
├── ph_30arc_100cm_us.tif [30 cm depth soil pH]
├── pondingFre_30arc_us.tif [ponding frequency]
├── sand_30arc_30cm_us.tif [30 cm depth soil sand content]
├── sand_30arc_100cm_us.tif [100 cm depth soil sand content]
├── silt_30arc_30cm_us.tif [30 cm depth soil silt content]
├── silt_30arc_100cm_us.tif [100 cm depth soil silt content]
├── water_content_30arc_30cm_us.tif [30 cm depth soil water content]
└── water_content_30arc_100cm_us.tif [100 cm depth soil water content] SOC TIF files: ├──30cm SOC mean.tif [30 cm depth soil SOC]
├──100cm SOC mean.tif [100 cm depth soil SOC]
├──30cm SOC CV.tif [30 cm depth soil SOC coefficient of variation]
└──100cm SOC CV.tif [100 cm depth soil SOC coefficient of variation] site observations csv files: ISCN_rmNRCS_addNCSS_30cm.csv 30cm ISCN sites SOC replaced NRCS sites with NCSS centroid removed data ISCN_rmNRCS_addNCSS_100cm.csv 100cm ISCN sites SOC replaced NRCS sites with NCSS centroid removed data
Data format: Geospatial files are provided in Geotiff format in Lat/Lon WGS84 EPSG: 4326 projection at 30 arc second resolution. Geospatial projection:
GEOGCS['GCS_WGS_1984', DATUM['D_WGS_1984', SPHEROID['WGS_1984',6378137,298.257223563]], PRIMEM['Greenwich',0], UNIT['Degree',0.017453292519943295]] (base) [jbk@theseus ltar_regionalization]$ g.proj -w GEOGCS['wgs84', DATUM['WGS_1984', SPHEROID['WGS_1984',6378137,298.257223563]], PRIMEM['Greenwich',0], UNIT['degree',0.0174532925199433]]
Facebook
TwitterThis study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository is composed of 2 compressed files, with the contents as next described.
--- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically:
+ prepare.py and main.py ⇨
The Python programs that implement the pipeline, both the auxiliary and the main pipeline
stages, respectively.
+ 'anomaly' and 'config' folders ⇨
Scripts and Python files containing the configuration and some basic functions that are
used to retrieve the information needed to process the data, like the actual resource
time series from OpenTSDB, or the job metadata from Slurm.
+ 'functions' folder ⇨
Several folders with the Python programs that implement all the stages of the pipeline,
either for the Machine Learning processing (e.g., extractors, aggregators, models), or
the technical aspect of the pipeline (e.g., pipelines, transformer).
+ plotDF.py ⇨
A Python program used to create the different plots presented, from the resource time
series to the evaluation plots.
+ several bash scripts ⇨
Used to run the experiments using a specific configuration, whether regarding which
transformers are chosen and how they are parametrized, or more technical aspects
involving how the pipeline is executed.
--- data.tar.gz --- The actual data and results, organized as follows:
+ jobs ⇨
All the jobs' resource time series plots for all the experiments, with a folder used
for each experiment. Inside each folder all the jobs are separated according to their
id, containing the plots for the different system resources (e.g., User CPU, Cached memory).
+ plots ⇨
All the predictions' plots for all the experiments in separated folders, mainly used for
evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These
plots are available for all the predictors resulting from the pipeline execution. In
addition, for each predictor it is also possible to visualize the resource time series
grouped by clusters. Finally, the projections as generated by the dimension reduction
models, and the outliers detected, are also available for each experiment.
+ datasets ⇨
The datasets used for the experiments, which include the lists of job IDs to be processed
(CSV files) and the results of each stage of the pipeline (e.g., features, predictions),
and the output text files as generated by several pipeline stages. Among these latter
files it is worth to note the evaluation ones, that include all the predictions scores.
Facebook
TwitterBackgroundData about long-term prognosis after hospitalisation of elderly multimorbid patients remains scarce.ObjectivesEvaluate medium and long-term prognosis in hospitalised patients older than 75 years of age with multimorbidity. Explore the impact of gender, age, frailty, physical dependence, and chronic diseases on mortality over a seven-year period.MethodsWe included prospectively all patients hospitalised for medical reasons over 75 years of age with two or more chronic illnesses in a specialised ward. Data on chronic diseases were collected using the Charlson comorbidity index and a questionnaire for disorders not included in this index. Demographic characteristics, Clinical Frailty Scale, Barthel index, and complications during hospitalisation were collected.Results514 patients (46% males) with a mean age of 85 (± 5) years were included. The median follow-up was 755 days (interquartile range 25–75%: 76–1,342). Mortality ranged from 44% to 68%, 82% and 91% at one, three, five, and seven years. At inclusion, men were slightly younger and with lower levels of physical impairment. Nevertheless, in the multivariate analysis, men had higher mortality (p<0.001; H.R.:1.43; 95% C.I.95%:1.16–1.75). Age, Clinical Frailty Scale, Barthel, and Charlson indexes were significant predictors in the univariate and multivariate analysis (all p<0.001). Dementia and neoplastic diseases were statistically significant in the unadjusted but not the adjusted model. In a cluster analysis, three patterns of patients were identified, with increasing significant mortality differences between them (p<0.001; H.R.:1.67; 95% CI: 1.49–1.88).ConclusionsIn our cohort, individual diseases had a limited predictive prognostic capacity, while the combination of chronic illness, frailty, and physical dependence were independent predictors of survival.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset from Tuyishime M, Spreng RL, et al. Multivariate analysis of FcR-mediated NK cell functions identifies unique clustering among humans and rhesus macaques. Frontiers in Immunology 2023 doi: 10.3389/fimmu.2023.1260377
Facebook
TwitterLibraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Facebook
TwitterThe Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. Please use this data set to clustering the iris flowers data. You can use k-means clustering algorithm.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterThis data set collection consists of data products described in Hoffman et. al., 2013. Resource and logistical constraints limit the frequency and extent of environmental observations, particularly in the Arctic, necessitating the development of a systematic sampling strategy to maximize coverage and objectively represent environmental variability at desired scales. A quantitative methodology for stratifying sampling domains, informing site selection, and determining the representativeness of measurement sites and networks is described here. Multivariate spatiotemporal clustering was applied to down-scaled general circulation model results and data for the State of Alaska at 4 km2 resolution to define multiple sets of ecoregions across two decadal time periods. Maps of ecoregions for the present (2000-2009) and future (2090-2099) were produced, showing how combinations of 37 characteristics are distributed and how they may shift in the future. Representative sampling locations are identified on present and future ecoregion maps. A representativeness metric was developed, and representativeness maps for eight candidate sampling locations were produced. This metric was used to characterize the environmental similarity of each site. This analysis provides model-inspired insights into optimal sampling strategies, offers a framework for up-scaling measurements, and provides a down-scaling approach for integration of models and measurements. These techniques can be applied at different spatial and temporal scales to meet the needs of individual measurement campaigns. This dataset contains one zipped file, one .txt file, and one .sh file. The Next-Generation Ecosystem Experiments: Arctic (NGEE Arctic), was a research effort to reduce uncertainty in Earth System Models by developing a predictive understanding of carbon-rich Arctic ecosystems and feedbacks to climate. NGEE Arctic was supported by the Department of Energy's Office of Biological and Environmental Research. The NGEE Arctic project had two field research sites: 1) located within the Arctic polygonal tundra coastal region on the Barrow Environmental Observatory (BEO) and the North Slope near Utqiagvik (Barrow), Alaska and 2) multiple areas on the discontinuous permafrost region of the Seward Peninsula north of Nome, Alaska. Through observations, experiments, and synthesis with existing datasets, NGEE Arctic provided an enhanced knowledge base for multi-scale modeling and contributed to improved process representation at global pan-Arctic scales within the Department of Energy's Earth system Model (the Energy Exascale Earth System Model, or E3SM), and specifically within the E3SM Land Model component (ELM).
Facebook
TwitterFactors associated with membership to behavioral clusters from multivariate stepwise ordinal logistic regression (n = 1,058).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).