100+ datasets found

Characteristics of the ecocentric vs. social-ecological clusters identified...
plos.figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière (2023). Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs). [Dataset]. http://doi.org/10.1371/journal.pone.0272223.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0272223.t001
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).
f
Data from: Improvements of birefringence imaging techniques to observe...
tandf.figshare.com
pdf
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kensei Toyoda; Hirotaka Manaka; Yoko Miura (2025). Improvements of birefringence imaging techniques to observe stress-induced ferroelectricity in SrTiO3 based on K-means clustering with circular statistics [Dataset]. http://doi.org/10.6084/m9.figshare.24512017.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24512017.v1
Dataset updated
May 8, 2025
Dataset provided by
Taylor & Francis
Authors
Kensei Toyoda; Hirotaka Manaka; Yoko Miura
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optical birefringence imaging techniques have enabled quantitative evaluation of macroscopic structures, e.g. domains and grain boundaries. With inhomogeneous samples, the selection of regions for analysis can significantly affect the conclusions; thus, arbitrary selection can lead to inaccurate findings. Thus, in this study, we present a method to cluster all birefringence imaging data using K-means multivariate clustering on a pixel-by-pixel basis to eliminate arbitrariness in the region selection process. Linear statistics cannot be applied to the polarization states of light described by angles and their periodicity; thus, circular statistics are used for clustering. By applying this approach to a 42,280-pixel image comprising 12 explanatory variables of stress-induced ferroelectricity in SrTiO3, we were able to select a region of locally developed spontaneous polarization. This region covers only 1.9% of the total area, where the stress and/or strain is concentrated, thereby resulting in a higher ferroelectric phase transition temperature and larger spontaneous polarization than in the other regions. The K-means multivariate clustering with circular statistics is shown to be a powerful tool to eliminate arbitrariness. The proposed method is a significant analysis technique that can be applied to images using the polarization of light, azimuthal angle of crystals, scattering angle. Multimodal birefringence imaging data represented by angles and their periodicities were clustered by K-means method using circular statistics. We successfully selected large spontaneous polarization area without arbitrariness in analysis process.
d
Data from: \"Size\" and \"shape\" in the measurement of multivariate...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Greenacre (2025). \"Size\" and \"shape\" in the measurement of multivariate proximity [Dataset]. http://doi.org/10.5061/dryad.6r5j8
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.6r5j8
Dataset updated
Jul 4, 2025
Dataset provided by
Dryad Digital Repository
Authors
Michael Greenacre
Time period covered
Mar 16, 2018
Description
Ordination and clustering methods are widely applied to ecological data that are nonnegative, for example species abundances or biomasses. These methods rely on a measure of multivariate proximity that quantifies differences between the sampling units (e.g. individuals, stations, time points), leading to results such as: (i) ordinations of the units, where interpoint distances optimally display the measured differences; (ii) clustering the units into homogeneous clusters; or (iii) assessing differences between pre-specified groups of units (e.g., regions, periods, treatment-control groups). 2. These methods all conceal a fundamental question: To what extent are the differences between the sampling units, computed according to the chosen proximity function, capturing the "size" in the multivariate observations, or their "shape"? "Size" means the overall level of the measurements: for example, some samples contain higher total abundances or more biomass, others less. "Shape" mea...
Congruence (C) and spatial clustering results for fish network modules and...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel J. McGarvey; Joseph A. Veech (2023). Congruence (C) and spatial clustering results for fish network modules and partitioning around medoids (PAM) clusters. [Dataset]. http://doi.org/10.1371/journal.pone.0208720.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0208720.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Daniel J. McGarvey; Joseph A. Veech
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Congruence (C) and spatial clustering results for fish network modules and partitioning around medoids (PAM) clusters.
n
Data from: Identifying stationary phases in multivariate time series for...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Dec 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rémi Patin; Marie-Pierre Etienne; Emilie Lebarbier; Simon Benhamou; Simon Chamaillé‐Jammes (2019). Identifying stationary phases in multivariate time series for highlighting behavioural modes and home range settlements [Dataset]. http://doi.org/10.5061/dryad.2j63369
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2j63369
Dataset updated
Dec 2, 2019
Dataset provided by
Centre d'Écologie Fonctionnelle et Évolutive
,
Institut de recherche mathématique de Rennes
Authors
Rémi Patin; Marie-Pierre Etienne; Emilie Lebarbier; Simon Benhamou; Simon Chamaillé‐Jammes
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Hwange National Park
Description
Recent advances in bio-logging open promising perspectives in the study of animal movements at numerous scales. It is now possible to record time-series of animal locations and ancillary data (e.g. activity level derived from on-board accelerometers) over extended areas and long durations with a high spatial and temporal resolution. Such time-series are often piecewise stationary, as the animal may alternate between different stationary phases (i.e. characterised by a specific mean and variance of some key parameter for limited periods). Identifying when these phases start and end is a critical first step to understand the dynamics of the underlying movement processes.

We introduce a new segmentation-clustering method we called segclust2d (available as a R package at cran.r-project.org/package=segclust2d). It can segment bi- (or more generally multi-) variate time-series and possibly cluster the various segments obtained, corresponding to different phases assumed to be stationary. This method is easy to use, as it only requires specifying a minimum segment length (to prevent over-segmentation), based on biological rather than statistical considerations.

This method can be applied to bivariate piecewise time-series of any nature. We focus here on two types of time-series related to animal movement, corresponding to (i) at large scale, series of bivariate coordinates of relocations, to highlight temporary home ranges, and (ii) at smaller scale, bivariate series derived from relocations data, such as speed and turning angle, to highlight different behavioural modes such as transit, feeding and resting.

Using computer simulations, we show that segclust2d can rival and even outperform previous, more complex methods, which were specifically developed to highlight changes in movement modes or home range shifts (based on Hidden Markov and Ornstein-Uhlenbeck modelling), which, contrary to our method, usually require the user to provide relevant initial guesses to be efficient. Furthermore we demonstrate it on actual examples involving a zebra's small scale movements and an elephant's large scale movements, to illustrate how various movement modes and home range shifts, respectively, can be identified. 15-Aug-2019
f
Pathologies in which cluster assignment had a significant influence in the...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Migliore, Daniela Paula; del Campo, Pablo Gómez; Sanz-Martín, Guillermo; del Castillo-Izquierdo, José; Domínguez, Juan Manuel (2024). Pathologies in which cluster assignment had a significant influence in the clinical outcome according to the multivariate analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001324521
Explore at:
Dataset updated
Nov 27, 2024
Authors
Migliore, Daniela Paula; del Campo, Pablo Gómez; Sanz-Martín, Guillermo; del Castillo-Izquierdo, José; Domínguez, Juan Manuel
Description
Pathologies in which cluster assignment had a significant influence in the clinical outcome according to the multivariate analysis.
Cross-table with the number of elders (and row percentages) classified in...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Cristina Paulo; Adriana Sampaio; Nadine Correia Santos; Patrício Soares Costa; Pedro Cunha; Joseph Zihl; João Cerqueira; Joana Almeida Palha; Nuno Sousa (2023). Cross-table with the number of elders (and row percentages) classified in the four clusters originated by the multivariate regression tree (T-cluster) and by the K-means clustering (C-cluster) analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0024553.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0024553.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ana Cristina Paulo; Adriana Sampaio; Nadine Correia Santos; Patrício Soares Costa; Pedro Cunha; Joseph Zihl; João Cerqueira; Joana Almeida Palha; Nuno Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cross-table with the number of elders (and row percentages) classified in the four clusters originated by the multivariate regression tree (T-cluster) and by the K-means clustering (C-cluster) analysis.
Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15337959
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
o; o
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 4, 2025
Description
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

ESM_2.py – Python script to calculate Z-scores from raw financial ratios

ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

ESM_5.xlsx – Mahalanobis distance values for each firm

ESM_6.py – Python script to compute Mahalanobis distances

ESM_7.py – Python script to visualize Mahalanobis distances

ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

ESM_9.py – Python script to compute mean Z-scores

ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

ESM_11.py – Python script to re-standardize mean Z-scores

ESM_12.py – Python script to generate the hierarchical clustering dendrogram

All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
f
Database for spatial clustering analysis.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mandja, Bien-Aimé; Situakibanza, Hippolyte; Ozer, Pierre; Kayembe, Harry César; Linard, Catherine; Muwonga, Jérémie; Batumbo, Doudou; Bompangue, Didier; Moutschen, Michel; Matunga, Muriel (2023). Database for spatial clustering analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000955653
Explore at:
Dataset updated
Aug 28, 2023
Authors
Mandja, Bien-Aimé; Situakibanza, Hippolyte; Ozer, Pierre; Kayembe, Harry César; Linard, Catherine; Muwonga, Jérémie; Batumbo, Doudou; Bompangue, Didier; Moutschen, Michel; Matunga, Muriel
Description
BackgroundThe dynamics of the spread of cholera epidemics in the Democratic Republic of the Congo (DRC), from east to west and within western DRC, have been extensively studied. However, the drivers of these spread processes remain unclear. We therefore sought to better understand the factors associated with these spread dynamics and their potential underlying mechanisms.MethodsIn this eco-epidemiological study, we focused on the spread processes of cholera epidemics originating from the shores of Lake Kivu, involving the areas bordering Lake Kivu, the areas surrounding the lake areas, and the areas out of endemic eastern DRC (eastern and western non-endemic provinces). Over the period 2000–2018, we collected data on suspected cholera cases, and a set of several variables including types of conflicts, the number of internally displaced persons (IDPs), population density, transportation network density, and accessibility indicators. Using multivariate ordinal logistic regression models, we identified factors associated with the spread of cholera outside the endemic eastern DRC. We performed multivariate Vector Auto Regressive models to analyze potential underlying mechanisms involving the factors associated with these spread dynamics. Finally, we classified the affected health zones using hierarchical ascendant classification based on principal component analysis (PCA).FindingsThe increase in the number of suspected cholera cases, the exacerbation of conflict events, and the number of IDPs in eastern endemic areas were associated with an increased risk of cholera spreading outside the endemic eastern provinces. We found that the increase in suspected cholera cases was influenced by the increase in battles at lag of 4 weeks, which were influenced by the violence against civilians with a 1-week lag. The violent conflict events influenced the increase in the number of IDPs 4 to 6 weeks later. Other influences and uni- or bidirectional causal links were observed between violent and non-violent conflicts, and between conflicts and IDPs. Hierarchical clustering on PCA identified three categories of affected health zones: densely populated urban areas with few but large and longer epidemics; moderately and accessible areas with more but small epidemics; less populated and less accessible areas with more and larger epidemics.ConclusionOur findings argue for monitoring conflict dynamics to predict the risk of geographic expansion of cholera in the DRC. They also suggest areas where interventions should be appropriately focused to build their resilience to the disease.
f
Multivariate analysis (Cox regression with same patients clustering) for...
datasetcatalog.nlm.nih.gov
Updated Feb 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haviv, Yosef S.; Shafat, Tali; Novack, Victor; Barski, Leonid (2023). Multivariate analysis (Cox regression with same patients clustering) for all-cause mortality (n = 556,595 tests of 105,207 patients). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000950333
Explore at:
Dataset updated
Feb 21, 2023
Authors
Haviv, Yosef S.; Shafat, Tali; Novack, Victor; Barski, Leonid
Description
Multivariate analysis (Cox regression with same patients clustering) for all-cause mortality (n = 556,595 tests of 105,207 patients).
w
Upscaling soil organic carbon measurements at the continental scale using...
soilwise-he.containers.wur.nl
Updated Jun 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Upscaling soil organic carbon measurements at the continental scale using multivariate clustering analysis and machine learning [Dataset]. http://doi.org/10.5281/zenodo.8057232
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8057232
Dataset updated
Jun 22, 2023
Description
Data Description: To improve SOC estimation in the United States, we upscaled site-based SOC measurements to the continental scale using multivariate geographic clustering (MGC) approach coupled with machine learning models. First, we used the MGC approach to segment the United States at 30 arc second resolution based on principal component information from environmental covariates (gNATSGO soil properties, WorldClim bioclimatic variables, MODIS biological variables, and physiographic variables) to 20 SOC regions. We then trained separate random forest model ensembles for each of the SOC regions identified using environmental covariates and soil profile measurements from the International Soil Carbon Network (ISCN) and an Alaska soil profile data. We estimated United States SOC for 0-30 cm and 0-100 cm depths were 52.6 + 3.2 and 108.3 + 8.2 Pg C, respectively. Files in collection (32): Collection contains 22 soil properties geospatial rasters, 4 soil SOC geospatial rasters, 2 ISCN site SOC observations csv files, and 4 R scripts gNATSGO TIF files: ├── available_water_storage_30arc_30cm_us.tif [30 cm depth soil available water storage]
├── available_water_storage_30arc_100cm_us.tif [100 cm depth soil available water storage]
├── caco3_30arc_30cm_us.tif [30 cm depth soil CaCO3 content]
├── caco3_30arc_100cm_us.tif [100 cm depth soil CaCO3 content]
├── cec_30arc_30cm_us.tif [30 cm depth soil cation exchange capacity]
├── cec_30arc_100cm_us.tif [100 cm depth soil cation exchange capacity]
├── clay_30arc_30cm_us.tif [30 cm depth soil clay content]
├── clay_30arc_100cm_us.tif [100 cm depth soil clay content]
├── depthWT_30arc_us.tif [depth to water table]
├── kfactor_30arc_30cm_us.tif [30 cm depth soil erosion factor]
├── kfactor_30arc_100cm_us.tif [100 cm depth soil erosion factor]
├── ph_30arc_100cm_us.tif [100 cm depth soil pH]
├── ph_30arc_100cm_us.tif [30 cm depth soil pH]
├── pondingFre_30arc_us.tif [ponding frequency]
├── sand_30arc_30cm_us.tif [30 cm depth soil sand content]
├── sand_30arc_100cm_us.tif [100 cm depth soil sand content]
├── silt_30arc_30cm_us.tif [30 cm depth soil silt content]
├── silt_30arc_100cm_us.tif [100 cm depth soil silt content]
├── water_content_30arc_30cm_us.tif [30 cm depth soil water content]
└── water_content_30arc_100cm_us.tif [100 cm depth soil water content] SOC TIF files: ├──30cm SOC mean.tif [30 cm depth soil SOC]
├──100cm SOC mean.tif [100 cm depth soil SOC]
├──30cm SOC CV.tif [30 cm depth soil SOC coefficient of variation]
└──100cm SOC CV.tif [100 cm depth soil SOC coefficient of variation] site observations csv files: ISCN_rmNRCS_addNCSS_30cm.csv 30cm ISCN sites SOC replaced NRCS sites with NCSS centroid removed data ISCN_rmNRCS_addNCSS_100cm.csv 100cm ISCN sites SOC replaced NRCS sites with NCSS centroid removed data
Data format: Geospatial files are provided in Geotiff format in Lat/Lon WGS84 EPSG: 4326 projection at 30 arc second resolution. Geospatial projection:
GEOGCS['GCS_WGS_1984', DATUM['D_WGS_1984', SPHEROID['WGS_1984',6378137,298.257223563]], PRIMEM['Greenwich',0], UNIT['Degree',0.017453292519943295]] (base) [jbk@theseus ltar_regionalization]$ g.proj -w GEOGCS['wgs84', DATUM['WGS_1984', SPHEROID['WGS_1984',6378137,298.257223563]], PRIMEM['Greenwich',0], UNIT['degree',0.0174532925199433]]
Sea surface and the sea floor regionalization of the Southern Ocean by...
doi.pangaea.de
html, tsv
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerstin Jerosch; Hendrik Pehlke; Katharina Teschke; Lukas Weber; Teresa Heidemann; Frauke Katharina Scharf (2016). Sea surface and the sea floor regionalization of the Southern Ocean by multivariate cluster analysis, links to ArcGIS project files [Dataset]. http://doi.org/10.1594/PANGAEA.856105
Explore at:
html, tsvAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.856105
Dataset updated
2016
Dataset provided by
PANGAEA
Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven
Authors
Kerstin Jerosch; Hendrik Pehlke; Katharina Teschke; Lukas Weber; Teresa Heidemann; Frauke Katharina Scharf
Area covered

Variables measured
File name, File size, File format, Uniform resource locator/link to file
Description
This study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.

Datasets and source code for a pipeline architecture for feature-based...

data.mendeley.com

Updated Dec 14, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonatan Enes (2022). Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs [Dataset]. http://doi.org/10.17632/hgkv9cpnmn.2

Explore at:

Unique identifier

https://doi.org/10.17632/hgkv9cpnmn.2

Dataset updated

Dec 14, 2022

Authors

Jonatan Enes

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository is composed of 2 compressed files, with the contents as next described.

--- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically:

+ prepare.py and main.py ⇨ 
  The Python programs that implement the pipeline, both the auxiliary and the main pipeline 
  stages, respectively. 

+ 'anomaly' and 'config' folders ⇨ 
  Scripts and Python files containing the configuration and some basic functions that are 
  used to retrieve the information needed to process the data, like the actual resource 
  time series from OpenTSDB, or the job metadata from Slurm.

+ 'functions' folder ⇨ 
  Several folders with the Python programs that implement all the stages of the pipeline, 
  either for the Machine Learning processing (e.g., extractors, aggregators, models), or 
  the technical aspect of the pipeline (e.g., pipelines, transformer).

+ plotDF.py ⇨ 
  A Python program used to create the different plots presented, from the resource time 
  series to the evaluation plots.

+ several bash scripts ⇨ 
  Used to run the experiments using a specific configuration, whether regarding which 
  transformers are chosen and how they are parametrized, or more technical aspects 
  involving how the pipeline is executed.

--- data.tar.gz --- The actual data and results, organized as follows:

+ jobs ⇨ 
  All the jobs' resource time series plots for all the experiments, with a folder used 
  for each experiment. Inside each folder all the jobs are separated according to their 
  id, containing the plots for the different system resources (e.g., User CPU, Cached memory).

+ plots ⇨ 
  All the predictions' plots for all the experiments in separated folders, mainly used for 
  evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These 
  plots are available for all the predictors resulting from the pipeline execution. In 
  addition, for each predictor it is also possible to visualize the resource time series 
  grouped by clusters. Finally, the projections as generated by the dimension reduction 
  models, and the outliers detected, are also available for each experiment.

+ datasets ⇨ 
  The datasets used for the experiments, which include the lists of job IDs to be processed 
  (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), 
  and the output text files as generated by several pipeline stages. Among these latter 
  files it is worth to note the evaluation ones, that include all the predictions scores.

f
Cluster analysis.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Almagro, Pere; Malik, Komal; Martinez-Urrea, Ana; Molina, Siena; Martínez-Camblor, Pablo; Libori, Ginebra; Monzon, Helena (2023). Cluster analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001028104
Explore at:
Dataset updated
Jun 2, 2023
Authors
Almagro, Pere; Malik, Komal; Martinez-Urrea, Ana; Molina, Siena; Martínez-Camblor, Pablo; Libori, Ginebra; Monzon, Helena
Description
BackgroundData about long-term prognosis after hospitalisation of elderly multimorbid patients remains scarce.ObjectivesEvaluate medium and long-term prognosis in hospitalised patients older than 75 years of age with multimorbidity. Explore the impact of gender, age, frailty, physical dependence, and chronic diseases on mortality over a seven-year period.MethodsWe included prospectively all patients hospitalised for medical reasons over 75 years of age with two or more chronic illnesses in a specialised ward. Data on chronic diseases were collected using the Charlson comorbidity index and a questionnaire for disorders not included in this index. Demographic characteristics, Clinical Frailty Scale, Barthel index, and complications during hospitalisation were collected.Results514 patients (46% males) with a mean age of 85 (± 5) years were included. The median follow-up was 755 days (interquartile range 25–75%: 76–1,342). Mortality ranged from 44% to 68%, 82% and 91% at one, three, five, and seven years. At inclusion, men were slightly younger and with lower levels of physical impairment. Nevertheless, in the multivariate analysis, men had higher mortality (p<0.001; H.R.:1.43; 95% C.I.95%:1.16–1.75). Age, Clinical Frailty Scale, Barthel, and Charlson indexes were significant predictors in the univariate and multivariate analysis (all p<0.001). Dementia and neoplastic diseases were statistically significant in the unadjusted but not the adjusted model. In a cluster analysis, three patterns of patients were identified, with increasing significant mortality differences between them (p<0.001; H.R.:1.67; 95% CI: 1.49–1.88).ConclusionsIn our cohort, individual diseases had a limited predictive prognostic capacity, while the combination of chronic illness, frailty, and physical dependence were independent predictors of survival.
Data from: Multivariate analysis of FcR-mediated NK cell functions...
data.niaid.nih.gov
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marina Tuyishime; Rachel L. Spreng (2023). Multivariate analysis of FcR-mediated NK cell functions identifies unique clustering among humans and rhesus macaques - dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8436836
Explore at:
Dataset updated
Oct 12, 2023
Dataset provided by
Duke University Human Vaccine Institute
Department of Surgery, Duke University, Durham, NC
Authors
Marina Tuyishime; Rachel L. Spreng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset from Tuyishime M, Spreng RL, et al. Multivariate analysis of FcR-mediated NK cell functions identifies unique clustering among humans and rhesus macaques. Frontiers in Immunology 2023 doi: 10.3389/fimmu.2023.1260377
Shopping Mall
kaggle.com
zip
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Explore at:
zip(22852 bytes)Available download formats
Dataset updated
Dec 15, 2023
Authors
Anshul Pachauri
Description
Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Clustering Iris Data Set
kaggle.com
zip
Updated Sep 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rifki Ilham (2023). Clustering Iris Data Set [Dataset]. https://www.kaggle.com/datasets/rifkiilham/clustering-iris-data-set/suggestions
Explore at:
zip(11024 bytes)Available download formats
Dataset updated
Sep 2, 2023
Authors
Rifki Ilham
Description
The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. Please use this data set to clustering the iris flowers data. You can use k-means clustering algorithm.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
d
Representativeness-based Sampling Network Design for the State of Alaska
dataone.org
search.dataone.org
+2more
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forrest Hoffman; Jitendra Kumar; Richard Mills; William Hargrove (2024). Representativeness-based Sampling Network Design for the State of Alaska [Dataset]. http://doi.org/10.5440/1108686
Explore at:
Unique identifier
https://doi.org/10.5440/1108686
Dataset updated
Jul 24, 2024
Dataset provided by
ESS-DIVE
Authors
Forrest Hoffman; Jitendra Kumar; Richard Mills; William Hargrove
Time period covered
Jun 1, 2013
Area covered

Description
This data set collection consists of data products described in Hoffman et. al., 2013. Resource and logistical constraints limit the frequency and extent of environmental observations, particularly in the Arctic, necessitating the development of a systematic sampling strategy to maximize coverage and objectively represent environmental variability at desired scales. A quantitative methodology for stratifying sampling domains, informing site selection, and determining the representativeness of measurement sites and networks is described here. Multivariate spatiotemporal clustering was applied to down-scaled general circulation model results and data for the State of Alaska at 4 km2 resolution to define multiple sets of ecoregions across two decadal time periods. Maps of ecoregions for the present (2000-2009) and future (2090-2099) were produced, showing how combinations of 37 characteristics are distributed and how they may shift in the future. Representative sampling locations are identified on present and future ecoregion maps. A representativeness metric was developed, and representativeness maps for eight candidate sampling locations were produced. This metric was used to characterize the environmental similarity of each site. This analysis provides model-inspired insights into optimal sampling strategies, offers a framework for up-scaling measurements, and provides a down-scaling approach for integration of models and measurements. These techniques can be applied at different spatial and temporal scales to meet the needs of individual measurement campaigns. This dataset contains one zipped file, one .txt file, and one .sh file. The Next-Generation Ecosystem Experiments: Arctic (NGEE Arctic), was a research effort to reduce uncertainty in Earth System Models by developing a predictive understanding of carbon-rich Arctic ecosystems and feedbacks to climate. NGEE Arctic was supported by the Department of Energy's Office of Biological and Environmental Research. The NGEE Arctic project had two field research sites: 1) located within the Arctic polygonal tundra coastal region on the Barrow Environmental Observatory (BEO) and the North Slope near Utqiagvik (Barrow), Alaska and 2) multiple areas on the discontinuous permafrost region of the Seward Peninsula north of Nome, Alaska. Through observations, experiments, and synthesis with existing datasets, NGEE Arctic provided an enhanced knowledge base for multi-scale modeling and contributed to improved process representation at global pan-Arctic scales within the Department of Energy's Earth system Model (the Energy Exascale Earth System Model, or E3SM), and specifically within the E3SM Land Model component (ELM).
f
Factors associated with membership to behavioral clusters from multivariate...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doyle, Aoife M.; Newton, Charles R. J. C.; Williams, Thomas N.; Abubakar, Amina; Ssewanyana, Derrick; Nyutu, Gideon; Walumbe, David; Otiende, Mark; Nyaguara, Amek; Nyundo, Christopher; Amadi, David; Mochamah, George; Ross, David A.; Bauni, Evasius (2020). Factors associated with membership to behavioral clusters from multivariate stepwise ordinal logistic regression (n = 1,058). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000499985
Explore at:
Dataset updated
Nov 12, 2020
Authors
Doyle, Aoife M.; Newton, Charles R. J. C.; Williams, Thomas N.; Abubakar, Amina; Ssewanyana, Derrick; Nyutu, Gideon; Walumbe, David; Otiende, Mark; Nyaguara, Amek; Nyundo, Christopher; Amadi, David; Mochamah, George; Ross, David A.; Bauni, Evasius
Description
Factors associated with membership to behavioral clusters from multivariate stepwise ordinal logistic regression (n = 1,058).

Facebook

Twitter

Click to copy link

Link copied

Cite

Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière (2023). Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs). [Dataset]. http://doi.org/10.1371/journal.pone.0272223.t001

Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0272223.t001

Dataset updated

Jun 16, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Céline Fromont; Julien Blanco; Christian Culas; Emmanuel Pannier; Mireille Razafindrakoto; François Roubaud; Stéphanie M. Carrière

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).

Clear search

Close search

Google apps

Main menu

Characteristics of the ecocentric vs. social-ecological clusters identified...

Data from: Improvements of birefringence imaging techniques to observe...

Data from: \"Size\" and \"shape\" in the measurement of multivariate...

Congruence (C) and spatial clustering results for fish network modules and...

Data from: Identifying stationary phases in multivariate time series for...

Pathologies in which cluster assignment had a significant influence in the...

Cross-table with the number of elders (and row percentages) classified in...

Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

Database for spatial clustering analysis.

Multivariate analysis (Cox regression with same patients clustering) for...

Upscaling soil organic carbon measurements at the continental scale using...

Sea surface and the sea floor regionalization of the Southern Ocean by...

Datasets and source code for a pipeline architecture for feature-based...

Cluster analysis.

Data from: Multivariate analysis of FcR-mediated NK cell functions...

Shopping Mall

Clustering Iris Data Set

Educational Attainment in North Carolina Public Schools: Use of statistical...

Representativeness-based Sampling Network Design for the State of Alaska

Factors associated with membership to behavioral clusters from multivariate...

Characteristics of the ecocentric vs. social-ecological clusters identified in respondents’ individual cognitive maps (ICMs).