U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING
MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI
Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.
From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 unique cell type labels. We also labeled cell types by hiearchical hand-gating data within cellengine (cellengine.com). We also created another gold standard for comparison by overclustering unormalized data with X-shift angular clustering. Finally, we created one last label as the major cell type call from each cell from all 21 cell type labels in the dataset.
Consequently the dataset has individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from (4 total). The rest are labels generated by all the clustering and normalization techniques used in the manuscript and what were compared to each other. These also were the data that were used for neighborhood analysis for the last figure of the manuscript. These are provided at all four levels of cell type level granularity (from 7 cell types to 35 cell types).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.
The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.
data.json
The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.
{ "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }
training_set.json and test_set.json
The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.
Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.
{ "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }
Dataset Statistics:
-
Papers
Predicates
Research Fields
Research Problems
Min/Comparison
2
2
1
0
Max/Comparison
202
112
5
23
Avg./Comparison
21,54
12,79
1,20
1,09
Total
4060
1816
46
178
Dataset Splits:
-
Papers
Comparisons
Training Set
2857
214
Test Set
1203
180
Manually labeled 555 metamodels mined from GitHub in April 2017. Domains: (1) bibliography, (2) conference management, (3) bug/issue tracker, (4) build systems, (5) document/office products, (6) requirement/use case, (7) database/sql, (8) state machines, (9) petri nets Procedure for constructing the dataset: fully manual, by searching for certain keywords and regexes (e.g. "state" and "transition" for state machines) in the metamodels and inspecting the results for inclusion. Format for the file names: ABSINDEX_CLUSTER_ITEMINDEX_name_hash.ecore
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
6000 English and 6000 French user reviews from three applications on Google Play (Garmin Connect, Huawei Health, Samsung Health) are labelled manually. We employed three labels: problem report, feature request, and irrelevant.
As we can observe from the following table, that shows examples of labelled user reviews, each review belongs to one or more categories.
App | Language | Total | Feature request | Problem report | Irrelevant |
---|---|---|---|---|---|
Garmin Connect | en | 2000 | 223 | 579 | 1231 |
Garmin Connect | fr | 2000 | 217 | 772 | 1051 |
Huawei Health | en | 2000 | 415 | 876 | 764 |
Huawei Health | fr | 2000 | 387 | 842 | 817 |
Samsung Health | en | 2000 | 528 | 500 | 990 |
Samsung Health | fr | 2000 | 496 | 492 | 1047 |
1200 bilingual labeled user reviews for clustering evaluation. From each of the three applications and for each of the two languages present in the classification dataset, we randomly selected 100 problem reports and 100 feature requests. Subsequently, we conducted manual clustering on each collection of 200 bilingual reviews, all of which pertained to the same category.
Garmin Connect | Huawei Health | Samsung Health | |
#clusters in feature request | 89 | 74 | 69 |
#clusters(𝑠𝑖𝑧𝑒≥5) in feature request | 7 | 9 | 11 |
#clusters in problem report | 45 | 44 | 41 |
#clusters(𝑠𝑖𝑧𝑒≥5) in problem report | 10 | 13 | 12 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“
We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (DOI:10.1029/2024JA032431).
This work was funded by grant 2225463 from the NSF GEM program.
The following tables detail the contents of the described files:
labeled_sunside_data.csv description
Column Name |
Description |
Epoch |
Epoch in datetime |
probe |
MMS probe name |
ratio_max_width |
Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information |
ratio_high_low |
Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information |
norm_Btot |
Magnitude of the total magnetic field normalized to 50nT. See paper for more information |
small_energy_mean |
The denominator in ratio_high_low |
large_energy_mean |
The numerator in ratio_high_low |
temp_total |
Total temperature from the DIS moments. See paper for more information |
r_gse_x |
x position of the spacecraft in GSE |
r_gse_y |
y position of the spacecraft in GSE |
r_gse_z |
z position of the spacecraft in GSE |
r_gsm_x |
x position of the spacecraft in GSM |
r_gsm_y |
y position of the spacecraft in GSM |
r_gsm_z |
z position of the spacecraft in GSM |
mlat |
magnetic latitude of spacecraft |
mlt |
magnetic local time of spacecraft |
raw_named_label |
Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock) |
modified_named_label |
Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information |
transition_name |
Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information |
Column Name |
Description |
start |
Starting Epoch in datetime |
stop |
Stopping Epoch in datetime |
probe |
MMS probe name |
region |
Cleansed cluster name associated with 1-minute resolution “modified_named_label” |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Incidence of newly diagnosed childhood cancer (140/1,000,000 children under 15 years) and nephroblastoma (7/1,000,000) was simulated. Clusters of defined size (1 to 50) were randomly assembled on the district level in Germany. Each cluster was simulated with different relative risk levels (1 to 100). For each combination 2000 iterations were done. Simulated data was then analyzed by three local clustering tests: Besag-Newell method, spatial scan statistic and Bayesian Besag-York-Mollié with Integrated Nested Laplace Approximation approach. The operating characteristics of all three methods were systematically documented (sensitivity, specificity, positive/negative predictive values, exact and minimum power, correct classification, positive/negative diagnostic likelihood and false positive/negative rate).
The performance of each of the various cluster detection methods and scenarios in this study is reported according to the quality criteria detailed below.
Minimum Power (MP): Proportion of simulations detecting at least one district of the true cluster. Exact Power (EP): Proportion of simulations detecting the true cluster without false positives. Sensitivity (sens): Proportion of correctly detected districts in the true cluster. Specificity (spec): Percentage of normal risk districts, correctly classified as normal risk districts. Positive predictive value (PPV): Proportion of districts in the detected cluster belonging to the true cluster. Negative predictive value (NPV): Proportion of districts not labeled as a risk cluster that is not part of the true cluster. Correct classification (CC): Percentage of correctly classified districts of all districts. Correct proportion (CP): Correctly labeled districts of all detected potential high-risk districts. Positive diagnostic likelihood (PDL): The ratio of high-risk districts being detected, divided by the probability non-high-risk districts being detected (sensitivity / (1-specificity). Negative diagnostic likelihood (NDL): The ratio of high-risk districts not being detected divided by the probability of non-high-risk districts not being detected ((1 – sensitivity) /specificity). False positive rate (FPR): Incorrectly labeled high-risk districts of all detected high-risk districts False negative rate (FNR): Incorrectly labeled normal-risk districts of all detected normal-risk districts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MIBI-TOF data for lymph node dataset reported in Liu et al., Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering
1. mibi_single_channel_tifs.zip: Single-channel MIBI-TOF images
Folders are labeled according to the field-of-view (FOV) number. Each folder contains single-channel TIFFs for each marker in the panel. Images are 1024x1024 pixels, 500 um. See paper for details.
2. segmentation.zip: Segmentation output of MIBI-TOF images
Cell segmentation was performed using Mesmer (Greenwald NF, Nature Biotechnology 2021). Output of Mesmer that delineates the single cells in each of the images is included.
3. source_data.zip: Source data files for figures
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiplexed imaging is a recently developed and powerful single-cell biology research tool. However, it presents new sources of technical noise that are distinct from other types of single-cell data, necessitating new practices for single-cell multiplexed imaging processing and analysis, particularly regarding cell-type identification. Here we created single-cell multiplexed imaging datasets by performing CODEX on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. After cell segmentation, we implemented five different normalization techniques crossed with four unsupervised clustering algorithms, resulting in 20 unique cell-type annotations for the same dataset. We generated two standard annotations: hand-gated cell types and cell types produced by over-clustering with spatial verification. We then compared these annotations at four levels of cell-type granularity. First, increasing cell-type granularity led to decreased labeling accuracy; therefore, subtle phenotype annotations should be avoided at the clustering step. Second, accuracy in cell-type identification varied more with normalization choice than with clustering algorithm. Third, unsupervised clustering better accounted for segmentation noise during cell-type annotation than hand-gating. Fourth, Z-score normalization was generally effective in mitigating the effects of noise from single-cell multiplexed imaging. Variation in cell-type identification will lead to significant differential spatial results such as cellular neighborhood analysis; consequently, we also make recommendations for accurately assigning cell-type labels to CODEX multiplexed imaging.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The code for paper '' Semi-supervised non-negative matrix factorization with structure preserving for image clustering''. This paper constructs a new label matrix with weights and further construct a label constraint regularizer to both utilize the label information and maintain the intrinsic structure of NMF. Based on the label constraint regularizer, the basis images of labeled data are extracted for monitoring and modifying the basis images learning of all data by establishing a basis regularizer. By incorporating the label constraint regularizer and the basis regularizer into NMF, a new semi-supervised NMF method is introduced. The proposed method is applied to image clustering and experimental results demonstrate the effectiveness of the proposed method in contrast with state-of-the-art unsupervised and semi-supervised algorithms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository supports manuscript “Modular-topology optimization of structures and mechanisms with free material design and clustering” by M. Tyburec, M. Doškář, J. Zeman, and M. Kružík, first published as preprint 2111.10439 at arXiv.org.
This repository contains:
./mFMO/
)./MTO/
)./data/
)1. Data flow
The test suite considered in the manuscript covers 4 problems:
mbb
)inv
)grip
)invgrip
)Each problem in the dataset is stored within a separate subfolder named according to the labels mentioned above. The final level of subdirectories {X}color
comprises of the results for problems with X
denoting the number of edge codes considered for each edge direction during the clustering (0color
stands for a non-modular design and 1color
represents the design based on Periodic Unit Cell).
Each of the folders contains outputs of the modular free material optimisation in the following form:
{label}{X}.mat
{label}{X}.til
{label}{X}.tset
{label}{X}guess.mat
Files *.til
, *.tset
, and *guess.mat
are then converted into a JSON input file for the modular topology optimization code with generator scripts which can be found in ./MTO/scripts
folder. Note that each of the problems in the test suite has its own generator script generate_modular_problem_{MBB,inverter,gripper,inverterAndGripper}.mat
. The generator scripts make a directory named according to the key MTO_{n}_kernelSensitivity
, where n
denotes the resolution of each module (i.e. the number of nodes along one direction). The directory also contains the outputs of the modular topology optimisation in the form of the initial and the final state of the optimization in VTK
files and visualisation of the final state in SVG
files. The log file log.txt
stores the optimized objective and progress of the value along with stopping criteria quantities during iterations.
2. Running codes
2.1 Modular free material optimisation
MATLAB scripts and functions for (modular) Free Material Optimization (FMO) are contained in the mFMO
data folder. The codes have been tested with MATLAB R2019b. To run the codes the user is required to install the PENNON optimizer. A free academic license is provided by its authors on request.
Input files for individual problems are defined in the mFMO/problems
folder and are launched with the runproblem(problemName, numClusters)
, where problemName
refers to the file in the mFMO/problems
folder without the file extension and numClusters
denotes the maximum number of color codes in Wang tiling formalism.
If successful, the optimization produces output files in mFMO/fmo_fig/{label}/{X}colors/{T}/
:
{label}{X}.mat
(contains clustering and tiling information){label}{X}_tmp.mat
(contains results of non-modular FMO){label}{X}.til
(the assembly plan){label}{X}.tset
(Wang tile set){label}{X}guess.mat
(guess for TO)where T
is the optimization time stamp.
2.2 Modular topology optimisation
All results were obtained with version v0.1.0
, which is also provided in the folder MTO
, and linked Intel® oneAPI Math Kernel Library and the incorporated PARDISO sparse solver. For the recent development of the code see the open git repository at https://gitlab.com/MartinDoskar/modular-topology-optimization. The repository also contains a detailed description of input parameters and code design.
Modular topology optimisation code uses CMake for the cross-platform build automation. For instance, under Linux, the whole code can be compiled in the standard five steps:
cd ./MTO
mkdir build
cd ./build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
All executables are automatically stored in ./MTO/bin/
folder. Individual problems can be optimized by parsing the JSON files obtained from the generator scripts as an argument to the MTO.Application binary, e.g.,
./MTO/bin/MTO.Application.exe path_to_data/mbb/2color/MTO_100_kernelSensitivity/input_modular_mbb_2colours_100.json
Acknowledgement
The related research and code development was supported by the Czech Science Foundation, project No. 19-26143X.
This study investigates the biogeographic patterns of Pacific white-sided dolphins (Lagenorhynchus obliquidens) in the Eastern North Pacific based on long-term passive acoustic records (2005-2021). We aim to elucidate the ecological and behavioral significance of distinct echolocation click types and their implications for population delineation, geographic distribution, environmental adaptation, and management. Over 50 cumulative years of Passive Acoustic Monitoring (PAM) data from 14 locations were analyzed using a deep neural network to classify two distinct Pacific white-sided dolphin echolocation click types. The study assessed spatial, diel, seasonal, and interannual patterns of the two click types, correlating them with major environmental drivers such as the El Niño Southern Oscillation and the North Pacific Gyre Oscillation, and modeling long-term spatial-seasonal patterns. Distinct spatial, seasonal, and diel patterns were observed for each click type. Significant biogeographi..., Raw acoustic data was passed through a click detector which returned all acoustic signals within an expected frequency range and duration of odontocete echolocation clicks. An unsupervised clustering algorithm was run on the detections to group them into 5-minute bin-level averages. Cluster bins were then labeled as one of six categories by a trained neural network. Clusters labeled as either one of two Pacific white-sided dolphin click types were extracted and manually verified. Verified pacific white-sided dolphin detections were then binned into 'click-positive minutes per hour', where a click positive minute was a minute that contained any number of clicks. The timeseries of click-positive minutes per hour, for each click type, at multiple long-term recording locations, is included here. , , # Pacific white-sided dolphin hourly binned echolocation clicks
https://doi.org/10.5061/dryad.95x69p8rj
Each CSV file contains the hourly acoustic presence of Pacific white-sided dolphin echolocation clicks. The files are formatted such that the click type and location are stored in the file header. For instance, "SCB_LoA.csv" represents the hourly presence of the LoA click type at recording station SCB_M.Â
The recording effort has been included here as a CSV file titled "PWD_effort.csv". Therefore, a user can cross-reference the recording location, recording effort, and time series name. If the one-click type was not detected at a given recording site, then that time series was not included.
Each of the datasets contains two columns:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Acute myeloid leukemia (AML) is a clinically heterogeneous group of cancers. While some patients respond well to chemotherapy, we describe here a subgroup with distinct molecular features that has very poor prognosis under chemotherapy. The classification of AML relies substantially on cytogenetics, but most cytogenetic abnormalities do not offer targets for development of targeted therapeutics. Therefore, it is important to create a detailed molecular characterization of the subgroup most in need of new targeted therapeutics.Methods: We used a multi-omics approach to identify a molecular subgroup with the worst response to chemotherapy, and to identify promising drug targets specifically for this AML subgroup.Results: Multi-omics clustering analysis resulted in three primary clusters among 166 AML adult cancer cases in TCGA data. One of these clusters, which we label as the high-risk molecular subgroup (HRMS), consisted of cases that responded very poorly to standard chemotherapy, with only about 10% survival to 2 years. The gene TP53 was mutated in most cases in this subgroup but not in all of them. The top six genes over-expressed in the HRMS subgroup included E2F4, CD34, CD109, MN1, MMLT3, and CD200. Multi-omics pathway analysis using RNA and CNA expression data identified in the HRMS subgroup over-activated pathways related to immune function, cell proliferation, and DNA damage.Conclusion: A distinct subgroup of AML patients are not successfully treated with chemotherapy, and urgently need targeted therapeutics based on the molecular features of this subgroup. Potential drug targets include over-expressed genes E2F4, and MN1, as well as mutations in TP53, and several over-activated molecular pathways.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pre-processed and normalized Raman spectral data for three mouse placental tissue scans, and constructed image data at three different wavenumbers.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We employed a data-driven, meta-analytic clustering approach to an extensive body of reward processing neuroimaging results archived in the BrainMap database (www.brainmap.org) to characterize meta-analytic groupings (MAGs) of reward processing experiments based on the spatial similarity of brain activation patterns. Using a data-driven, meta-analytic, k-means clustering approach, we dissociated five meta-analytic groupings (MAGs) of neuroimaging results (i.e., brain activation maps) from 749 experimental contrasts across 177 reward processing studies involving 13,345 healthy participants. We objectively identified a five-MAG solution which represented dissociated patterns of activation consistently occurring across reward processing tasks (MAG-1: ventral-striatal; MAG-2: dorsal-striatal; MAG-3: limbic-parietal; MAG-4: frontal-parietal; MAG-5: medial frontal-posterior cingulate). The optimal clustering-solution was selected based on majority rule of four information-theoretic metrics and, subsequently, convergent brain activity across each grouping of neuroimaging experiments was quantified via separate meta-analyses.
To compile a large corpus of neuroimaging results across reward processing paradigms, we extracted activation coordinates reported in published studies that were archived in the BrainMap Database as of April 22, 2016, under the meta-data labels Reward, Delay Discounting, and Gambling (www.brainmap.org) (Fox et al., 2005; Fox & Lancaster, 2002; Laird et al., 2011). The vast majority (94.9%) of identified studies were archived under the Reward label with most Delay Discounting and Gambling studies being additionally archived under Reward. The Reward label denotes that the reported activation coordinates were identified in a task where a stimulus served to reinforce a desired response (e.g., monetary reward after a correct response) (www.brainmap.org/taxonomy/paradigms). Almost all studies included in the corpus were also archived under a variety of other meta-data labels (e.g. Task Switching (6.4%), Go/No-Go (2.9%), Visuospatial Attention (2.9%), Reasoning/Problem Solving (1.3%), Wisconsin Card Sorting Test (2.6%)) which is unsurprising as reward processing is a multifaceted construct, connecting elements of sensation, perception, cognitive control, and other mental operations.
We considered only activation coordinates from published neuroimaging studies, among healthy participants, that were reported in standard Talairach (Talairach & Tournoux, 1988) or Montreal Neurological Institute (MNI) (Collins, 1994) space and derived from whole-brain statistical comparisons. Brain coordinates derived through behavioral correlations or a priori region of interest (ROI) analyses were excluded. As this meta-analysis aimed to investigate brain activation linked with typical reward processing, coordinates from groups of individuals with psychological or neuropsychiatric disorders (e.g., addictive disorders) were excluded from the corpus. Each included study provided at least one experimental contrast that statistically identified brain activity associated with a certain task-event defined by the original authors (e.g., a brain activity map). These experimental contrasts were summarized and curated in the BrainMap database as a set of brain activity foci linked either with phases of the original task (i.e., task response, anticipation of outcome, outcome delivery) or stimuli presented in the task (i.e., positive outcome, negative outcome, high reward, low reward). Foci from experimental contrasts can also reflect locations of brain activity linked with more abstract and computationally derived constructs of interest in the original study (e.g., learning rate, subjective value).
homo sapiens
fMRI-BOLD
meta-analysis
None / Other
Z
This dataset contains a subset of LADC passive acoustic system EARS buoys data which was collected in 2015 (data inventory can be found in R4.x261.233:0005) and is used to identify three different species of beaked whales in the Gulf of Mexico. The species of beaked whales examined in this dataset are Cuvier’s beaked whale, Gervais’ beaked whale, and an unidentified species that we labeled "BWG", which stands for Beaked Whale of the Gulf. Recordings were processed using a click detection algorithm. Then unsupervised as well as supervised classification algorithms were evaluated for distinguishing species by echolocation features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Polyploidization plays a critical role in producing new gene functions and promoting species evolution. Effective identification of polyploid types can be helpful in exploring the evolutionary mechanism. However, current methods for detecting polyploid types have some major limitations, such as being time-consuming and strong subjectivity, etc. In order to objectively and scientifically recognize collinearity fragments and polyploid types, we developed PolyReco method, which can automatically label collinear regions and recognize polyploidy events based on the KS dotplot. Combining with whole-genome collinearity analysis, PolyReco uses DBSCAN clustering method to cluster KS dots. According to the distance information in the x-axis and y-axis directions between the categories, the clustering results are merged based on certain rules to obtain the collinear regions, automatically recognize and label collinear fragments. According to the information of the labeled collinear regions on the y-axis, the polyploidization recognition algorithm is used to exhaustively combine and obtain the genetic collinearity evaluation index of each combination, and then draw the genetic collinearity evaluation index graph. Based on the inflection point on the graph, polyploid types and related chromosomes with polyploidy signal can be detected. The validation experiments showed that the conclusions of PolyReco were consistent with the previous study, which verified the effectiveness of this method. It is expected that this approach can become a reference architecture for other polyploid types classification methods.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global market for Cluster Analysis Software is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for advanced data interpretation across diverse sectors. While precise market sizing data is unavailable, considering the growth observed in related fields like data analytics and AI, a reasonable estimate for the 2025 market size could be placed between $2.5 billion and $3 billion. This estimate assumes a moderate growth trajectory reflecting the maturation of the cluster analysis market and the ongoing integration of these tools into broader business intelligence platforms. Assuming a Compound Annual Growth Rate (CAGR) of 15% for the forecast period (2025-2033), the market is projected to reach a substantial size within the next decade. This growth is fueled by several key drivers, including the expanding availability of large datasets, the growing demand for data-driven decision-making across industries like BFSI (Banking, Financial Services, and Insurance), government, and commercial sectors, and the continuous development of more sophisticated algorithms and user-friendly interfaces for cluster analysis software. The cloud-based segment is expected to dominate, given its scalability and accessibility benefits, although web-based applications will continue to hold a significant market share. Geographic growth will be diverse, with North America and Europe maintaining strong positions due to advanced analytics adoption, but significant expansion is also expected in the Asia-Pacific region as technological advancement and data infrastructure improve. However, challenges like data privacy concerns, the need for skilled professionals, and the high cost of advanced software solutions could act as market restraints in certain regions. The competitive landscape is marked by a mix of established players such as IBM, Microsoft, and TIBCO Software, along with a growing number of specialized vendors and emerging technology companies. The market is characterized by ongoing innovation in areas like algorithm development, enhanced visualization capabilities, and the integration of cluster analysis with other advanced analytics tools. This continuous innovation will be a key driver in sustaining the market's high CAGR and ensuring its continued growth in the coming years. Increased focus on providing tailored solutions for specific industry verticals will likely be a strategic advantage for vendors seeking a competitive edge. The market's future hinges on its ability to effectively address the challenges of data complexity, security, and user-friendliness while continuing to deliver accurate and actionable insights.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 7 datasets with each set containing 3D shapes with varying topological complexity. The datasets can be used to compare different metrics of geometric dissimilarity. Two of the datasets have topologically complex shapes that resemble designs obtained from topology optimization, a widely used design optimization method for engineering structures.
We used this dataset for a related journal article with the following abstract: "In the early stages of engineering design, multitudes of feasible designs can be generated using structural optimization methods by varying the design requirements or user preferences for different performance objectives. Data mining such potentially large datasets is a challenging task. An unsupervised data-centric approach for exploring designs is to find clusters of similar designs and recommend only the cluster representatives for review. Design similarity can be defined not only on a purely functional level but also based on geometric properties, such as size, shape, and topology. While metrics such as chamfer distance measure the geometrical differences intuitively, it is more useful for design exploration to use metrics based on geometric features, which are extracted from high-dimensional 3D geometric data using dimensionality reduction techniques. If the Euclidean distance in the geometric features is meaningful, the features can be combined with performance attributes resulting in an aggregate feature vector that can potentially be useful in design exploration based on both geometry and performance. We propose a novel approach to evaluate such derived metrics by measuring their similarity with the metrics commonly used in 3D object classification. Furthermore, we measure clustering accuracy, which is a state-of-the-art unsupervised approach to evaluate metrics. For this purpose, we use a labeled, synthetic dataset with topologically complex designs. From our results, we conclude that Pointcloud Autoencoder is promising in encoding geometric features and developing a comprehensive design exploration method."
For each dataset, shapes/designs are saved as surface mesh files (extension: stl) and point cloud files (extension: ply) in the folders "stls" and "plys" respectively. A brief description of the 7 different datasets is in the following table. For each dataset, the designs are named using numbers starting from 0, e.g., “0.stl, 1.stl, …, 19.stl” in the folder for the surface mesh files. Some of the datasets are labeled, i.e., each design belongs to a class. In a labeled dataset, all classes have the same number of designs, and the designs are named in the order of their class. For example, a labeled dataset with 4 designs and 2 classes contains files whose names start with {0, 1, 2, 3} where the designs {0, 1} belong to class 1, and {2, 3} belong to class 2.
Dataset name | Directory name | Number of designs | Number of classes |
---|---|---|---|
Beam-rotation | "rotate_beam" | 20 | None |
Beam-elongation | "elongate_beam" | 20 | None |
Beam-translation | "move_beam" | 20 | None |
Three cube trusses | "three_cube_truss" | 150 | 6 |
Single cube trusses | "single_cube_truss" | 275 | 11 |
Random topologies | "three_cube_truss_random" | 1000 | 50 |
Topologically optimized designs | "cube_opt_shapes" | 1500 | None |
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING
MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI
Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.