Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pseudocode for the PCA-K-medoids clustering algorithm.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Simulation Data sets
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101025672. This work was supported by the Research Council of Norway through its Centres of Excellence scheme, project no. 262695. T.B.P acknowledges the support of the Centre for Advanced Study in Oslo, Norway, which funded and hosted the CAS research project Attosecond Quantum Dynamics Beyond the Born-Oppenheimer Approximation during the academic year 2021-2022. Partial support from the National Science Foundation (grant no. 1856702) is also acknowledged. Supplemental data for our article "A Quantum Definition of Molecular Structure". Version 1.1.0 contains data for additional k-medoids runs performed on different subsets of the complete sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of groups in k-medoids clustering approach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplemental data for our article "A Quantum Definition of Molecular Structure".
Version 1.1.0 contains data for additional k-medoids runs performed on different subsets of the complete sample.
Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to repr...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this dataset, 17 indicators have been collected and/or calculated for 32 European cities. For certain characteristics, plots have been made and included in this dataset. Finally, the urban borders and the cluster assignments for each city are also included for reproducibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The actual numbers of T (mesostable) and F (thermostable) classes in the original datasets were 1544 and 513, respectively. The highest accuracy (100%) was observed when the EMC clustering method was applied to datasets generated by Correlation and Uncertainty attribute weighting algorithms that highlighted in the table.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For a given nC (first row) the Dunn index (DI), the Davies-Bouldin index (DBI) and the size of the various clusters are shown. The column showing the best nC value is typed in italics. Frame set #1 was used in the analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of ANOVA and eta squared comparing proportion of variance in child height-for-age Z-score, women’s literacy score, and proportion of children who are deceased accounted for by Cameroonian DHS Wealth Index and EconomicClusters models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper presents a novel sound event detection (SED) system for rare events occurring in an open environment. Wavelet multiresolution analysis (MRA) is used to decompose the input audio clip of 30 seconds into five levels. Wavelet denoising is then applied on the third and fifth levels of MRA to filter out the background. Significant transitions, which may represent the onset of a rare event, are then estimated in these two levels by combining the peak-finding algorithm with the K-medoids clustering algorithm. The small portions of one-second duration, called ‘chunks’ are cropped from the input audio signal corresponding to the estimated locations of the significant transitions. Features from these chunks are extracted by the wavelet scattering network (WSN) and are given as input to a support vector machine (SVM) classifier, which classifies them. The proposed SED framework produces an error rate comparable to the SED systems based on convolutional neural network (CNN) architecture. Also, the proposed algorithm is computationally efficient and lightweight as compared to deep learning models, as it has no learnable parameter. It requires only a single epoch of training, which is 5, 10, 200, and 600 times lesser than the models based on CNNs and deep neural networks (DNNs), CNN with long short-term memory (LSTM) network, convolutional recurrent neural network (CRNN), and CNN respectively. The proposed model neither requires concatenation with previous frames for anomaly detection nor any additional training data creation needed for other comparative deep learning models. It needs to check almost 360 times fewer chunks for the presence of rare events than the other baseline systems used for comparison in this paper. All these characteristics make the proposed system suitable for real-time applications on resource-limited devices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In Source-2_Data_Augmentation:
Exercice1_augmentation.ipynb Jupyter Notebook for data warpping of defect images.
Exercice2_augmentation_multimodale.ipynb Jupyter Notebook for multimodal data augmentaion (defect images and mechanical fields) via oversampling
Exercice3_clustering.ipynb Data clustering using the k-medoids algorithm applied to mechanical dissimilarity of the defects.
k_medoids.py is a python code of a kmedoids algorithm.
in Data:
All_images.npy (numpy file) contains the defect images.
All_Stresses.npy (numpy) contains mechanical fields, All_Stresses[k,i,j,ic,it] is the instance number k of the component ic of the Cauchy stress tensor at time it. The mechanical problem is decribed in 〈10.5802/crmeca.51〉. 〈hal-03113503〉.
New_images_1.npy and New_Stresses_1.npy are augmented data for k=1.
New_images_87.npy and New_Stresses_87.npy are augmented data for k=87.
Dissimilarity_Stress.npy is the Frobenius norm of the distances between stress tensors (All_Stresses.npy).
We propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic data set on irises.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The USNM_COL_CAM dataset includes 912 high-resolution JPEG images (3030 × 2080 pixels) of specimen labels from diverse Coleoptera families, including Buprestidae, Carabidae, Cerambycidae, Chrysomelidae, Curculionidae, and Scarabaeidae. All images were digitized by the Smithsonian National Museum of Natural History and are annotated with multi-label information. The dataset represents specimens collected across South and Central America over various historical periods and supports research in coleopterology, biodiversity informatics, and computer vision.
The dataset is augmented with two derived data resources:
• OCR_USNM_COL_CAM.json: Transcribed label content generated using the Google Cloud Vision API, enabling automatic text extraction, content indexing, and structured metadata retrieval.
• Clustering_0.9_USNM_COL_CAM.csv: K-Medoids clustering output based on a 0.9 textual similarity threshold, useful for identifying duplicate records, grouping related specimens, and supporting scalable label processing workflows.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Floodplains can have a significant impact on the routing of flood waves across the landscape, yet their representation in broad-scale water resource and flood prediction models are limited. To identify hydraulically-relevant floodplains at scale, we develop a workflow that automates the extraction of reach-averaged morphologic features from high resolution topographic data hypothesized to define a zone within the floodplain that conveys floodwaters distinctly from the surrounding landscape. This zone is identified from departures in hydraulic geometry with stage. Working in the topographically diverse Lake Champlain Basin in Vermont, USA, we apply the workflow to 2,629 reaches and use the extracted features to cluster settings similar in their proposed ability to route floodwaters. In total we identified eight clusters of reach types, two that were pre-sorted and largely lack a floodplain, and six that reflect variability in floodplain features, which were parsed out from the K-medoids clustering analysis. Clusters of floodplain types had distinct impact on the routing of synthetically-derived hydrographs, evaluated using the Muskingum-Cunge routing model. From these clusters we propose a Hydraulic Floodplain Classification, which is comparable to other geographically-defined systems but unique in its focus on the potential of the landscape to influence flood routing. The automated workflow may be repeated in other regions with high resolution topographic datasets, offering an improvement in the functionality of continental to global floodplain mapping efforts. Identification of hydraulically-effective zones has implications for improved watershed management to meet flood resiliency goals, and to improve flood predictions and warnings.
Gene IDs belonging to six different clusters from time-course expression profiling based on K-medoids. GO enrichment analyses performed using GO Term annotations TriTrypDB-36_TbruceiLister427_GO.gaf from TriTrypDB version 36 and Fisher’s exact test. GO, Gene Ontology. (XLSX)
Gene IDs belonging to four different clusters from time-course expression profiling based on K-medoids. GO enrichment analyses performed using GO Term annotations TriTrypDB-36_TbruceiLister427_GO.gaf from TriTrypDB version 36 and Fisher’s exact test. GO, Gene Ontology. (XLSX)
Distributional data for eight taxonomic groups (asteroids, bryozoans, benthic foraminiferans, octocorals, polychaetes, matrix-forming scleractinian corals, sponges, and benthic fish) have been used to train an environmental classification for those parts of New Zealand's 200 n. mile Exclusive Economic Zone (EEZ) with depths of 3000 m or less. A variety of environmental variables were used as input to this process, including estimates of depth, temperature, salinity, sea surface temperature gradient, surface water productivity, suspended sediments, tidal currents, and seafloor sediments and slope. These variables were transformed using results averaged across eight Generalised Dissimilarity Modelling analyses that indicate relationships between species turnover and environment for each species group. The matrix of transformed variables was then classified using k-meDOIds clustering to identify an initial set of 300 groups of cells based on their environmental similarities, with relationships between these groups then described using agglomerative hierarchical clustering. Groups at a fifteen group level of classification appropriate for use at a whole-of-EEZ scale are described; the classification can also be used at other levels of detail, for example when higher levels of classification detail are required to discriminate variation within study areas of more limited extent. Although not formally tested in this analysis, we expect the analytical process used here to increase the biological discrimination of the environmental classification. That is, the resulting environmental groups are more likely to have similar biological characteristics than when the input environmental variables are selected, weighted, and perhaps transformed using qualitative methods. As a consequence, they are more likely to be reliable when used as "habitat classes" for the management of biological values than groups defined using alternative approaches._Item Page Created: 2018-11-14 00:08 Item Page Last Modified: 2025-04-05 16:28Owner: NIWA_OpenDataBOMEC 15 ClassNo data edit dates availableFields: ID,GRIDCODE
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper presents a novel sound event detection (SED) system for rare events occurring in an open environment. Wavelet multiresolution analysis (MRA) is used to decompose the input audio clip of 30 seconds into five levels. Wavelet denoising is then applied on the third and fifth levels of MRA to filter out the background. Significant transitions, which may represent the onset of a rare event, are then estimated in these two levels by combining the peak-finding algorithm with the K-medoids clustering algorithm. The small portions of one-second duration, called ‘chunks’ are cropped from the input audio signal corresponding to the estimated locations of the significant transitions. Features from these chunks are extracted by the wavelet scattering network (WSN) and are given as input to a support vector machine (SVM) classifier, which classifies them. The proposed SED framework produces an error rate comparable to the SED systems based on convolutional neural network (CNN) architecture. Also, the proposed algorithm is computationally efficient and lightweight as compared to deep learning models, as it has no learnable parameter. It requires only a single epoch of training, which is 5, 10, 200, and 600 times lesser than the models based on CNNs and deep neural networks (DNNs), CNN with long short-term memory (LSTM) network, convolutional recurrent neural network (CRNN), and CNN respectively. The proposed model neither requires concatenation with previous frames for anomaly detection nor any additional training data creation needed for other comparative deep learning models. It needs to check almost 360 times fewer chunks for the presence of rare events than the other baseline systems used for comparison in this paper. All these characteristics make the proposed system suitable for real-time applications on resource-limited devices.
Source URL: https://www.rit.edu/cos/colorscience/re_AsanoObserverFunctions.php
Source DOI: 10.1371/journal.pone.0145671
Categorical observers
Categorical observers are observer functions that would represent color-normal populations. They are finite and discrete as opposed to observer functions generated from the individual colorimetric observer model. Thus, they would offer more convenient and practical approaches for the personalized color imaging workflow and color matching analyses. Categorical observers were derived in two steps. At the first step, 10,000 observer functions were generated from the individual colorimetric observer model using Monte Carlo simulation. At the second step, the cluster analysis, a modified k-medoids algorithm, was applied to the 10,000 observers minimizing the squared Euclidean distance in cone fundamentals space, and categorical observers were derived iteratively. Since the proposed categorical observers are defined by their physiological parameters and ages, their CMFs can be derived for any target field size.
Categorical observers were ordered by the importance; the first categorical observer vas the average observer equivalent to CIEPO06 with 38 year-old for a given field size, followed by the second most important categorical observer, the third, and so on.
The color matching analyses showed that ten categorical observers are good for general use and convenience to represent color normal populations. On average, the prediction error improvement was small after adding tenth categorical observers, and the prediction errors became one-third by introducing ten observers. Nevertheless, readers should be aware that the number of required categorical observers varies depending on an application (a pair of spectra viewed by observers). For example, the simulation revealed that as many as 50 categorical observers would be required to predict individual observers’ matches satisfactorily when a laser projector is viewed.
Matlab code for the categorical observers and CMFs as well as model parameters for ten categorical observers are available for download below.
151 color-normal observers
CMFs of 151 color-normal observers were estimated by combining the individual colorimetric observer model and the color matching proposed in Asano’s PhD dissertation. The color matching consisted of five color matches aimed to highlight and detect inter-observer variability among color-normals. To obtain a set of CMFs for a given human observer, at first, the observer performed the five color matches with three repetitions. Then, his/her eight physiological parameters (used in the individual colorimetric observer model) were estimated from the color matching results by a non-linear optimization. The objective function was to optimize the eight physiological parameters such that the color differences between the human observer results and model predictions were minimized. Finally, the CMFs were reconstructed from the estimated physiological parameters and the observer's real age.
The estimated CMFs for 151 color-normal human observers, the corresponding model parameters, and other information such as gender, experience in color-related subjective experiments, ethnic origin, color deficiency in family, diabetes, and intra-observer variability (Mean Color Difference from the Mean using CIEDE2000) for each of the 151 observers are available for download
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pseudocode for the PCA-K-medoids clustering algorithm.