65 datasets found

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
application/rdfxml +5
Updated Jun 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.nasa.gov/dataset/MULTI-LABEL-ASRS-DATASET-CLASSIFICATION-USING-SEMI/m4h6-922m
Explore at:
csv, application/rssxml, tsv, xml, application/rdfxml, jsonAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI

Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
Cell type labels for all clustering and normalization combinations compared...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Hickey (2022). Cell type labels for all clustering and normalization combinations compared for CODEX multiplexed imaging [Dataset]. http://doi.org/10.5061/dryad.dfn2z352c
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dfn2z352c
Dataset updated
Nov 17, 2022
Dataset provided by
Stanford University
Authors
John Hickey
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.

From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 unique cell type labels. We also labeled cell types by hiearchical hand-gating data within cellengine (cellengine.com). We also created another gold standard for comparison by overclustering unormalized data with X-shift angular clustering. Finally, we created one last label as the major cell type call from each cell from all 21 cell type labels in the dataset.

Consequently the dataset has individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from (4 total). The rest are labels generated by all the clustering and normalization techniques used in the manuscript and what were compared to each other. These also were the data that were used for neighborhood analysis for the last figure of the manuscript. These are provided at all four levels of cell type level granularity (from 7 cell types to 35 cell types).
Z
Dataset - Clustering Semantic Predicates in the Open Research Knowledge...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arab Oghli, Omar (2022). Dataset - Clustering Semantic Predicates in the Open Research Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6513498
Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Arab Oghli, Omar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.

data.json

The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.

{ "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }

training_set.json and test_set.json

The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.

Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.

{ "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }

Dataset Statistics:

- Papers Predicates Research Fields Research Problems Min/Comparison 2 2 1 0 Max/Comparison 202 112 5 23 Avg./Comparison 21,54 12,79 1,20 1,09 Total 4060 1816 46 178

Dataset Splits:

- Papers Comparisons Training Set 2857 214 Test Set 1203 180
o
A labeled Ecore metamodel dataset for domain clustering
explore.openaire.eu
zenodo.org
Updated Mar 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Önder Babur (2019). A labeled Ecore metamodel dataset for domain clustering [Dataset]. http://doi.org/10.5281/zenodo.2585431
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2585431
Dataset updated
Mar 6, 2019
Authors
Önder Babur
Description
Manually labeled 555 metamodels mined from GitHub in April 2017. Domains: (1) bibliography, (2) conference management, (3) bug/issue tracker, (4) build systems, (5) document/office products, (6) requirement/use case, (7) database/sql, (8) state machines, (9) petri nets Procedure for constructing the dataset: fully manual, by searching for certain keywords and regexes (e.g. "state" and "transition" for state machines) in the metamodels and inspecting the results for inclusion. Format for the file names: ABSINDEX_CLUSTER_ITEMINDEX_name_hash.ecore
d
Data from: Pseudo-Label Generation for Multi-Label Text Classification
catalog.data.gov
datasets.ai
+2more
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

Data from: Zero-shot Bilingual App Reviews Mining with Large Language Models...

zenodo.org
data.niaid.nih.gov

zip

Updated May 23, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray; Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray (2024). Zero-shot Bilingual App Reviews Mining with Large Language Models [Dataset]. http://doi.org/10.1109/ictai59109.2023.00135

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1109/ictai59109.2023.00135

Dataset updated

May 23, 2024

Dataset provided by

IEEE

Authors

Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray; Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Classification

6000 English and 6000 French user reviews from three applications on Google Play (Garmin Connect, Huawei Health, Samsung Health) are labelled manually. We employed three labels: problem report, feature request, and irrelevant.

Problem reports show the issues the users have experienced while using the app.
Feature requests reflect the demande of users on new function, new content, new interface, etc.
Irrelevant are the user reviews that do not belongs to the two aforementioned categories.

As we can observe from the following table, that shows examples of labelled user reviews, each review belongs to one or more categories.

App	Language	Total	Feature request	Problem report	Irrelevant
Garmin Connect	en	2000	223	579	1231
Garmin Connect	fr	2000	217	772	1051
Huawei Health	en	2000	415	876	764
Huawei Health	fr	2000	387	842	817
Samsung Health	en	2000	528	500	990
Samsung Health	fr	2000	496	492	1047

Clustering

1200 bilingual labeled user reviews for clustering evaluation. From each of the three applications and for each of the two languages present in the classification dataset, we randomly selected 100 problem reports and 100 feature requests. Subsequently, we conducted manual clustering on each collection of 200 bilingual reviews, all of which pertained to the same category.

	Garmin Connect	Huawei Health	Samsung Health
#clusters in feature request	89	74	69
#clusters(𝑠𝑖𝑧𝑒≥5) in feature request	7	9	11
#clusters in problem report	45	44	41
#clusters(𝑠𝑖𝑧𝑒≥5) in problem report	10	13	12

8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering...

zenodo.org

csv

Updated Jun 17, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Vicki Toy-Edens; Vicki Toy-Edens; Wenli Mo; Wenli Mo; Savvas Raptis; Savvas Raptis; Drew Turner; Drew Turner (2024). 8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering plasma regions classifications [Dataset]. http://doi.org/10.5281/zenodo.11032322

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11032322

Dataset updated

Jun 17, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Vicki Toy-Edens; Vicki Toy-Edens; Wenli Mo; Wenli Mo; Savvas Raptis; Savvas Raptis; Drew Turner; Drew Turner

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 11, 2024

Description

These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“

We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (DOI:10.1029/2024JA032431).

This work was funded by grant 2225463 from the NSF GEM program.

The following tables detail the contents of the described files:

labeled_sunside_data.csv description

Column Name	Description
Epoch	Epoch in datetime
probe	MMS probe name
ratio_max_width	Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information
ratio_high_low	Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information
norm_Btot	Magnitude of the total magnetic field normalized to 50nT. See paper for more information
small_energy_mean	The denominator in ratio_high_low
large_energy_mean	The numerator in ratio_high_low
temp_total	Total temperature from the DIS moments. See paper for more information
r_gse_x	x position of the spacecraft in GSE
r_gse_y	y position of the spacecraft in GSE
r_gse_z	z position of the spacecraft in GSE
r_gsm_x	x position of the spacecraft in GSM
r_gsm_y	y position of the spacecraft in GSM
r_gsm_z	z position of the spacecraft in GSM
mlat	magnetic latitude of spacecraft
mlt	magnetic local time of spacecraft
raw_named_label	Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock)
modified_named_label	Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information
transition_name	Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information

Column Name	Description
start	Starting Epoch in datetime
stop	Stopping Epoch in datetime
probe	MMS probe name
region	Cleansed cluster name associated with 1-minute resolution “modified_named_label”

m
Childhood Cancer Cluster Simulation - Cancer in Brief Manuscript Schündeln...
data.mendeley.com
narcis.nl
Updated Nov 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Schündeln (2020). Childhood Cancer Cluster Simulation - Cancer in Brief Manuscript Schündeln et al. 2020 [Dataset]. http://doi.org/10.17632/3hrg9tpsx9.2
Explore at:
Unique identifier
https://doi.org/10.17632/3hrg9tpsx9.2
Dataset updated
Nov 2, 2020
Authors
Michael Schündeln
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Incidence of newly diagnosed childhood cancer (140/1,000,000 children under 15 years) and nephroblastoma (7/1,000,000) was simulated. Clusters of defined size (1 to 50) were randomly assembled on the district level in Germany. Each cluster was simulated with different relative risk levels (1 to 100). For each combination 2000 iterations were done. Simulated data was then analyzed by three local clustering tests: Besag-Newell method, spatial scan statistic and Bayesian Besag-York-Mollié with Integrated Nested Laplace Approximation approach. The operating characteristics of all three methods were systematically documented (sensitivity, specificity, positive/negative predictive values, exact and minimum power, correct classification, positive/negative diagnostic likelihood and false positive/negative rate).

The performance of each of the various cluster detection methods and scenarios in this study is reported according to the quality criteria detailed below.

Minimum Power (MP): Proportion of simulations detecting at least one district of the true cluster. Exact Power (EP): Proportion of simulations detecting the true cluster without false positives. Sensitivity (sens): Proportion of correctly detected districts in the true cluster. Specificity (spec): Percentage of normal risk districts, correctly classified as normal risk districts. Positive predictive value (PPV): Proportion of districts in the detected cluster belonging to the true cluster. Negative predictive value (NPV): Proportion of districts not labeled as a risk cluster that is not part of the true cluster. Correct classification (CC): Percentage of correctly classified districts of all districts. Correct proportion (CP): Correctly labeled districts of all detected potential high-risk districts. Positive diagnostic likelihood (PDL): The ratio of high-risk districts being detected, divided by the probability non-high-risk districts being detected (sensitivity / (1-specificity). Negative diagnostic likelihood (NDL): The ratio of high-risk districts not being detected divided by the probability of non-high-risk districts not being detected ((1 – sensitivity) /specificity). False positive rate (FPR): Incorrectly labeled high-risk districts of all detected high-risk districts False negative rate (FNR): Incorrectly labeled normal-risk districts of all detected normal-risk districts
Robust phenotyping of highly multiplexed tissue imaging data using...
zenodo.org
zip
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Candace C Liu; Michael Angelo; Candace C Liu; Michael Angelo (2023). Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering (lymph node MIBI-TOF data) [Dataset]. http://doi.org/10.5281/zenodo.8096953
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8096953
Dataset updated
Jul 6, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Candace C Liu; Michael Angelo; Candace C Liu; Michael Angelo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MIBI-TOF data for lymph node dataset reported in Liu et al., Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering

1. mibi_single_channel_tifs.zip: Single-channel MIBI-TOF images

Folders are labeled according to the field-of-view (FOV) number. Each folder contains single-channel TIFFs for each marker in the panel. Images are 1024x1024 pixels, 500 um. See paper for details.

2. segmentation.zip: Segmentation output of MIBI-TOF images

Cell segmentation was performed using Mesmer (Greenwald NF, Nature Biotechnology 2021). Output of Mesmer that delineates the single cells in each of the images is included.

3. source_data.zip: Source data files for figures

pixel_ccs_allpreprocessing.csv: Cluster consistency score (CCS) for all pixels using all preprocessing steps, related to Fig. 2d-f, Supp. Fig. 4,5,9,10

pixel_ccs_nopixelnorm.csv: CCS for all pixels where pixel normalization was left out, related to Fig. 2f, Supp. Fig. 6

pixel_ccs_nochannelnorm.csv: CCS for all pixels where channel normalization was left out, related to Fig. 2f, Supp. Fig. 8

pixel_ccs_passes1.csv: CCS for all pixels where 1 pass was used for SOM training, related to Supp. Fig. 10

pixel_ccs_passes100.csv: CCS for all pixels where 100 passes were used for SOM training, related to Supp. Fig. 10

pixel_ccs_sigma0.csv: CCS for all pixels where a Gaussian blur sigma of 0 was used for preprocessing, related to Supp. Fig. 5

pixel_ccs_sigma1.csv: CCS for all pixels where a Gaussian blur sigma of 1 was used for preprocessing, related to Supp. Fig. 5

pixel_ccs_sigma3.csv: CCS for all pixels where a Gaussian blur sigma of 3 was used for preprocessing, related to Supp. Fig. 5

pixel_ccs_nodes15.csv: CCS for all pixels where 15 nodes were used for SOM training, related to Supp. Fig. 9

pixel_ccs_threshold80.csv: CCS for all pixels where a threshold of 80% was used for CCS calculation, related to Supp. Fig. 4b

pixel_ccs_threshold98.csv: CCS for all pixels where a threshold of 98% was used for CCS calculation, related to Supp. Fig. 4b

pixel_info_comparison_table.csv: Number of pixels that were assigned to a cluster outside of cell segmentation masks, related to Fig. 3d

single_cell_pixel_composition_table.csv: Pixel composition information for each single cell, related to Fig. 5, Supp. Fig 16

single_cell_integrated_expression_table.csv: Integrated expression per cell, output by Mesmer, related to Fig. 5, Supp. Fig. 16

cell_silhouette_scores.csv: Silhouette scores for comparing integrated expression and pixel composition, related to Fig. 5d

cell_ccs_pixel_composition.csv: CCS for all cells using pixel composition for clustering, related to Supp. Fig. 16e, 17c

cell_ccs_integrated_expression.csv: CCS for all cells using integrated expression for clustering, related to Supp. Fig 16e-f

cell_ccs_integrated_expression_preprocessed.csv: CCS for all cells using integrated expression for clustering where data was preprocessed before integrating, related to Supp. Fig 17

cytof_ccs.csv: CCS of the CyTOF dataset used as a benchmark, related to Supp. Fig. 4c,d

scrnaseq_ccs.csv: CCS of the scRNA-seq dataset used as a benchmark, related to Supp. Fig. 4c,e

pixel_phenotype_maps: TIFFs where pixel value corresponds to pixel cluster number as reported in the paper

cell_phenotype_maps: TIFFs where pixel value corresponds to cell cluster number as reported in the paper
f
Table_1_Strategies for Accurate Cell Type Identification in CODEX...
frontiersin.figshare.com
xlsx
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John W. Hickey; Yuqi Tan; Garry P. Nolan; Yury Goltsev (2023). Table_1_Strategies for Accurate Cell Type Identification in CODEX Multiplexed Imaging Data.xlsx [Dataset]. http://doi.org/10.3389/fimmu.2021.727626.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2021.727626.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
John W. Hickey; Yuqi Tan; Garry P. Nolan; Yury Goltsev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiplexed imaging is a recently developed and powerful single-cell biology research tool. However, it presents new sources of technical noise that are distinct from other types of single-cell data, necessitating new practices for single-cell multiplexed imaging processing and analysis, particularly regarding cell-type identification. Here we created single-cell multiplexed imaging datasets by performing CODEX on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. After cell segmentation, we implemented five different normalization techniques crossed with four unsupervised clustering algorithms, resulting in 20 unique cell-type annotations for the same dataset. We generated two standard annotations: hand-gated cell types and cell types produced by over-clustering with spatial verification. We then compared these annotations at four levels of cell-type granularity. First, increasing cell-type granularity led to decreased labeling accuracy; therefore, subtle phenotype annotations should be avoided at the clustering step. Second, accuracy in cell-type identification varied more with normalization choice than with clustering algorithm. Third, unsupervised clustering better accounted for segmentation noise during cell-type annotation than hand-gating. Fourth, Z-score normalization was generally effective in mitigating the effects of noise from single-cell multiplexed imaging. Variation in cell-type identification will lead to significant differential spatial results such as cellular neighborhood analysis; consequently, we also make recommendations for accurately assigning cell-type labels to CODEX multiplexed imaging.
m
Data from: Semi-supervised non-negative matrix factorization with structure...
data.mendeley.com
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjing Jing (2024). Semi-supervised non-negative matrix factorization with structure preserving for image clustering [Dataset]. http://doi.org/10.17632/gf67wvrhbs.1
Explore at:
Unique identifier
https://doi.org/10.17632/gf67wvrhbs.1
Dataset updated
Dec 9, 2024
Authors
Wenjing Jing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The code for paper '' Semi-supervised non-negative matrix factorization with structure preserving for image clustering''. This paper constructs a new label matrix with weights and further construct a label constraint regularizer to both utilize the label information and maintain the intrinsic structure of NMF. Based on the label constraint regularizer, the basis images of labeled data are extracted for monitoring and modifying the basis images learning of all data by establishing a basis regularizer. By incorporating the label constraint regularizer and the basis regularizer into NMF, a new semi-supervised NMF method is introduced. The proposed method is applied to image clustering and experimental results demonstrate the effectiveness of the proposed method in contrast with state-of-the-art unsupervised and semi-supervised algorithms.
Supplementary codes and datasets for "Modular-topology optimization of...
zenodo.org
data.niaid.nih.gov
zip
Updated May 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marek Tyburec; Marek Tyburec; Martin Doškář; Martin Doškář; Jan Zeman; Jan Zeman; Martin Kružík; Martin Kružík (2023). Supplementary codes and datasets for "Modular-topology optimization of structures and mechanisms with free material design and clustering" [Dataset]. http://doi.org/10.5281/zenodo.5714298
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5714298
Dataset updated
May 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marek Tyburec; Marek Tyburec; Martin Doškář; Martin Doškář; Jan Zeman; Jan Zeman; Martin Kružík; Martin Kružík
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository supports manuscript “Modular-topology optimization of structures and mechanisms with free material design and clustering” by M. Tyburec, M. Doškář, J. Zeman, and M. Kružík, first published as preprint 2111.10439 at arXiv.org.

This repository contains:

MATLAB source codes for (modular) free material optimisation and hierarchical stiffness clustering (folder ./mFMO/)

C++ source codes for modular topology optimization (folder ./MTO/)

Input/output data of the test suite (folder ./data/)

1. Data flow

The test suite considered in the manuscript covers 4 problems:

Messerschmitt-Bölkow-Blohm beam (labelled as mbb)

Inverter compliant mechanism (labelled as inv)

Gripper compliant mechanism (labelled as grip)

Reusable design of both compliant mechanisms (labelled as invgrip)

Each problem in the dataset is stored within a separate subfolder named according to the labels mentioned above. The final level of subdirectories {X}color comprises of the results for problems with X denoting the number of edge codes considered for each edge direction during the clustering (0color stands for a non-modular design and 1color represents the design based on Periodic Unit Cell).

Each of the folders contains outputs of the modular free material optimisation in the following form:

{label}{X}.mat

{label}{X}.til

{label}{X}.tset

{label}{X}guess.mat

Files *.til, *.tset, and *guess.mat are then converted into a JSON input file for the modular topology optimization code with generator scripts which can be found in ./MTO/scripts folder. Note that each of the problems in the test suite has its own generator script generate_modular_problem_{MBB,inverter,gripper,inverterAndGripper}.mat. The generator scripts make a directory named according to the key MTO_{n}_kernelSensitivity, where n denotes the resolution of each module (i.e. the number of nodes along one direction). The directory also contains the outputs of the modular topology optimisation in the form of the initial and the final state of the optimization in VTK files and visualisation of the final state in SVG files. The log file log.txt stores the optimized objective and progress of the value along with stopping criteria quantities during iterations.

2. Running codes

2.1 Modular free material optimisation

MATLAB scripts and functions for (modular) Free Material Optimization (FMO) are contained in the mFMO data folder. The codes have been tested with MATLAB R2019b. To run the codes the user is required to install the PENNON optimizer. A free academic license is provided by its authors on request.

Input files for individual problems are defined in the mFMO/problems folder and are launched with the runproblem(problemName, numClusters), where problemName refers to the file in the mFMO/problems folder without the file extension and numClusters denotes the maximum number of color codes in Wang tiling formalism.

If successful, the optimization produces output files in mFMO/fmo_fig/{label}/{X}colors/{T}/:

{label}{X}.mat (contains clustering and tiling information)

{label}{X}_tmp.mat (contains results of non-modular FMO)

{label}{X}.til (the assembly plan)

{label}{X}.tset (Wang tile set)

{label}{X}guess.mat (guess for TO)

where T is the optimization time stamp.

2.2 Modular topology optimisation

All results were obtained with version v0.1.0, which is also provided in the folder MTO, and linked Intel® oneAPI Math Kernel Library and the incorporated PARDISO sparse solver. For the recent development of the code see the open git repository at https://gitlab.com/MartinDoskar/modular-topology-optimization. The repository also contains a detailed description of input parameters and code design.

Modular topology optimisation code uses CMake for the cross-platform build automation. For instance, under Linux, the whole code can be compiled in the standard five steps:

cd ./MTO mkdir build cd ./build cmake -DCMAKE_BUILD_TYPE=Release .. make

All executables are automatically stored in ./MTO/bin/ folder. Individual problems can be optimized by parsing the JSON files obtained from the generator scripts as an argument to the MTO.Application binary, e.g.,

./MTO/bin/MTO.Application.exe path_to_data/mbb/2color/MTO_100_kernelSensitivity/input_modular_mbb_2colours_100.json

Acknowledgement

The related research and code development was supported by the Czech Science Foundation, project No. 19-26143X.
d
Pacific white-sided dolphin hourly binned echolocation clicks
search.dataone.org
dataone.org
+2more
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michaela N. Alksne (2024). Pacific white-sided dolphin hourly binned echolocation clicks [Dataset]. https://search.dataone.org/view/sha256%3Ae073ff5e24b680e3f441e7d8178f242d25acafed4089485cd7da564d66c8d51b
Explore at:
Dataset updated
Jul 3, 2024
Dataset provided by
Dryad Digital Repository
Authors
Michaela N. Alksne
Time period covered
Jan 1, 2024
Description
This study investigates the biogeographic patterns of Pacific white-sided dolphins (Lagenorhynchus obliquidens) in the Eastern North Pacific based on long-term passive acoustic records (2005-2021). We aim to elucidate the ecological and behavioral significance of distinct echolocation click types and their implications for population delineation, geographic distribution, environmental adaptation, and management. Over 50 cumulative years of Passive Acoustic Monitoring (PAM) data from 14 locations were analyzed using a deep neural network to classify two distinct Pacific white-sided dolphin echolocation click types. The study assessed spatial, diel, seasonal, and interannual patterns of the two click types, correlating them with major environmental drivers such as the El NiÃ±o Southern Oscillation and the North Pacific Gyre Oscillation, and modeling long-term spatial-seasonal patterns.Â Distinct spatial, seasonal, and diel patterns were observed for each click type. Significant biogeographi..., Raw acoustic data was passed through a click detector which returned all acoustic signals within an expected frequency range and duration of odontocete echolocation clicks. An unsupervised clustering algorithm was run on the detections to group them into 5-minute bin-level averages. Cluster bins were then labeled as one of six categories by a trained neural network. Clusters labeled as either one of two Pacific white-sided dolphin click types were extracted and manually verified. Verified pacific white-sided dolphin detections were then binned into 'click-positive minutes per hour', where a click positive minute was a minute that contained any number of clicks. The timeseries of click-positive minutes per hour, for each click type, at multiple long-term recording locations, is included here.Â , , # Pacific white-sided dolphin hourly binned echolocation clicks

https://doi.org/10.5061/dryad.95x69p8rj

Each CSV file contains the hourly acoustic presence of Pacific white-sided dolphin echolocation clicks. The files are formatted such that the click type and location are stored in the file header. For instance, "SCB_LoA.csv" represents the hourly presence of the LoA click type at recording station SCB_M.Â

The recording effort has been included here as a CSV file titled "PWD_effort.csv". Therefore, a user can cross-reference the recording location, recording effort, and time series name. If the one-click type was not detected at a given recording site, then that time series was not included.

Each of the datasets contains two columns:

The first column is the time bin in hours (deploymentHour)

The second is the number of click-positive minutes in that hour (Click_pos_min_per_hour), with a maximum of 60 positive minutes. If any number o...
f
DataSheet3_Molecular Characterization of the Highest Risk Adult Patients...
frontiersin.figshare.com
txt
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trinh Nguyen; John W Pepper; Cu Nguyen; Yu Fan; Ying Hu; Qingrong Chen; Chunhua Yan; Daoud Meerzaman (2023). DataSheet3_Molecular Characterization of the Highest Risk Adult Patients With Acute Myeloid Leukemia (AML) Through Multi-Omics Clustering.CSV [Dataset]. http://doi.org/10.3389/fgene.2021.777094.s003
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.777094.s003
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Trinh Nguyen; John W Pepper; Cu Nguyen; Yu Fan; Ying Hu; Qingrong Chen; Chunhua Yan; Daoud Meerzaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Acute myeloid leukemia (AML) is a clinically heterogeneous group of cancers. While some patients respond well to chemotherapy, we describe here a subgroup with distinct molecular features that has very poor prognosis under chemotherapy. The classification of AML relies substantially on cytogenetics, but most cytogenetic abnormalities do not offer targets for development of targeted therapeutics. Therefore, it is important to create a detailed molecular characterization of the subgroup most in need of new targeted therapeutics.Methods: We used a multi-omics approach to identify a molecular subgroup with the worst response to chemotherapy, and to identify promising drug targets specifically for this AML subgroup.Results: Multi-omics clustering analysis resulted in three primary clusters among 166 AML adult cancer cases in TCGA data. One of these clusters, which we label as the high-risk molecular subgroup (HRMS), consisted of cases that responded very poorly to standard chemotherapy, with only about 10% survival to 2 years. The gene TP53 was mutated in most cases in this subgroup but not in all of them. The top six genes over-expressed in the HRMS subgroup included E2F4, CD34, CD109, MN1, MMLT3, and CD200. Multi-omics pathway analysis using RNA and CNA expression data identified in the HRMS subgroup over-activated pathways related to immune function, cell proliferation, and DNA damage.Conclusion: A distinct subgroup of AML patients are not successfully treated with chemotherapy, and urgently need targeted therapeutics based on the molecular features of this subgroup. Potential drug targets include over-expressed genes E2F4, and MN1, as well as mutations in TP53, and several over-activated molecular pathways.
Raman spectral data for mature mouse placenta scans
zenodo.org
data.niaid.nih.gov
Updated Jul 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arda Inanc; Arda Inanc (2023). Raman spectral data for mature mouse placenta scans [Dataset]. http://doi.org/10.5281/zenodo.8076483
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8076483
Dataset updated
Jul 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Arda Inanc; Arda Inanc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pre-processed and normalized Raman spectral data for three mouse placental tissue scans, and constructed image data at three different wavenumbers.
N
Meta-analytic clustering dissociates brain activity and behavior profiles...
neurovault.org
nifti
Updated Oct 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Meta-analytic clustering dissociates brain activity and behavior profiles across reward processing paradigms: k=5_MAG-1 [Dataset]. http://identifiers.org/neurovault.image:124625
Explore at:
niftiAvailable download formats
Unique identifier
https://identifiers.org/neurovault.image:124625
Dataset updated
Oct 20, 2019
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description

Collection description

We employed a data-driven, meta-analytic clustering approach to an extensive body of reward processing neuroimaging results archived in the BrainMap database (www.brainmap.org) to characterize meta-analytic groupings (MAGs) of reward processing experiments based on the spatial similarity of brain activation patterns. Using a data-driven, meta-analytic, k-means clustering approach, we dissociated five meta-analytic groupings (MAGs) of neuroimaging results (i.e., brain activation maps) from 749 experimental contrasts across 177 reward processing studies involving 13,345 healthy participants. We objectively identified a five-MAG solution which represented dissociated patterns of activation consistently occurring across reward processing tasks (MAG-1: ventral-striatal; MAG-2: dorsal-striatal; MAG-3: limbic-parietal; MAG-4: frontal-parietal; MAG-5: medial frontal-posterior cingulate). The optimal clustering-solution was selected based on majority rule of four information-theoretic metrics and, subsequently, convergent brain activity across each grouping of neuroimaging experiments was quantified via separate meta-analyses.

To compile a large corpus of neuroimaging results across reward processing paradigms, we extracted activation coordinates reported in published studies that were archived in the BrainMap Database as of April 22, 2016, under the meta-data labels Reward, Delay Discounting, and Gambling (www.brainmap.org) (Fox et al., 2005; Fox & Lancaster, 2002; Laird et al., 2011). The vast majority (94.9%) of identified studies were archived under the Reward label with most Delay Discounting and Gambling studies being additionally archived under Reward. The Reward label denotes that the reported activation coordinates were identified in a task where a stimulus served to reinforce a desired response (e.g., monetary reward after a correct response) (www.brainmap.org/taxonomy/paradigms). Almost all studies included in the corpus were also archived under a variety of other meta-data labels (e.g. Task Switching (6.4%), Go/No-Go (2.9%), Visuospatial Attention (2.9%), Reasoning/Problem Solving (1.3%), Wisconsin Card Sorting Test (2.6%)) which is unsurprising as reward processing is a multifaceted construct, connecting elements of sensation, perception, cognitive control, and other mental operations.
We considered only activation coordinates from published neuroimaging studies, among healthy participants, that were reported in standard Talairach (Talairach & Tournoux, 1988) or Montreal Neurological Institute (MNI) (Collins, 1994) space and derived from whole-brain statistical comparisons. Brain coordinates derived through behavioral correlations or a priori region of interest (ROI) analyses were excluded. As this meta-analysis aimed to investigate brain activation linked with typical reward processing, coordinates from groups of individuals with psychological or neuropsychiatric disorders (e.g., addictive disorders) were excluded from the corpus. Each included study provided at least one experimental contrast that statistically identified brain activity associated with a certain task-event defined by the original authors (e.g., a brain activity map). These experimental contrasts were summarized and curated in the BrainMap database as a set of brain activity foci linked either with phases of the original task (i.e., task response, anticipation of outcome, outcome delivery) or stimuli presented in the task (i.e., positive outcome, negative outcome, high reward, low reward). Foci from experimental contrasts can also reflect locations of brain activity linked with more abstract and computationally derived constructs of interest in the original study (e.g., learning rate, subjective value).

Subject species

homo sapiens

Modality

fMRI-BOLD

Analysis level

meta-analysis

Cognitive paradigm (task)

None / Other

Map type

Z
d
Detection and classification of beaked whale echolocation clicks recorded on...
search.dataone.org
data.griidc.org
Updated Jul 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GRIIDC (2019). Detection and classification of beaked whale echolocation clicks recorded on bottom-moored EARS buoys in the northern Gulf of Mexico from July-October 2015 [Dataset]. https://search.dataone.org/view/R4-x261-000-0014-0003
Explore at:
Dataset updated
Jul 9, 2019
Dataset provided by
GRIIDC
Time period covered
Jul 4, 2015 - Oct 11, 2015
Area covered

Description
This dataset contains a subset of LADC passive acoustic system EARS buoys data which was collected in 2015 (data inventory can be found in R4.x261.233:0005) and is used to identify three different species of beaked whales in the Gulf of Mexico. The species of beaked whales examined in this dataset are Cuvier’s beaked whale, Gervais’ beaked whale, and an unidentified species that we labeled "BWG", which stands for Beaked Whale of the Gulf. Recordings were processed using a click detection algorithm. Then unsupervised as well as supervised classification algorithms were evaluated for distinguishing species by echolocation features.
f
Table2_PolyReco: A Method to Automatically Label Collinear Regions and...
frontiersin.figshare.com
docx
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fushun Wang; Kang Zhang; Ruolan Zhang; Hongquan Liu; Weijin Zhang; Zhanxiao Jia; Chunyang Wang (2023). Table2_PolyReco: A Method to Automatically Label Collinear Regions and Recognize Polyploidy Events Based on the KS Dotplot.DOCX [Dataset]. http://doi.org/10.3389/fgene.2022.842387.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.842387.s002
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Fushun Wang; Kang Zhang; Ruolan Zhang; Hongquan Liu; Weijin Zhang; Zhanxiao Jia; Chunyang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Polyploidization plays a critical role in producing new gene functions and promoting species evolution. Effective identification of polyploid types can be helpful in exploring the evolutionary mechanism. However, current methods for detecting polyploid types have some major limitations, such as being time-consuming and strong subjectivity, etc. In order to objectively and scientifically recognize collinearity fragments and polyploid types, we developed PolyReco method, which can automatically label collinear regions and recognize polyploidy events based on the KS dotplot. Combining with whole-genome collinearity analysis, PolyReco uses DBSCAN clustering method to cluster KS dots. According to the distance information in the x-axis and y-axis directions between the categories, the clustering results are merged based on certain rules to obtain the collinear regions, automatically recognize and label collinear fragments. According to the information of the labeled collinear regions on the y-axis, the polyploidization recognition algorithm is used to exhaustively combine and obtain the genetic collinearity evaluation index of each combination, and then draw the genetic collinearity evaluation index graph. Based on the inflection point on the graph, polyploid types and related chromosomes with polyploidy signal can be detected. The validation experiments showed that the conclusions of PolyReco were consistent with the previous study, which verified the effectiveness of this method. It is expected that this approach can become a reference architecture for other polyploid types classification methods.
C
Cluster Analysis Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Cluster Analysis Software Report [Dataset]. https://www.archivemarketresearch.com/reports/cluster-analysis-software-59553
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for Cluster Analysis Software is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for advanced data interpretation across diverse sectors. While precise market sizing data is unavailable, considering the growth observed in related fields like data analytics and AI, a reasonable estimate for the 2025 market size could be placed between $2.5 billion and $3 billion. This estimate assumes a moderate growth trajectory reflecting the maturation of the cluster analysis market and the ongoing integration of these tools into broader business intelligence platforms. Assuming a Compound Annual Growth Rate (CAGR) of 15% for the forecast period (2025-2033), the market is projected to reach a substantial size within the next decade. This growth is fueled by several key drivers, including the expanding availability of large datasets, the growing demand for data-driven decision-making across industries like BFSI (Banking, Financial Services, and Insurance), government, and commercial sectors, and the continuous development of more sophisticated algorithms and user-friendly interfaces for cluster analysis software. The cloud-based segment is expected to dominate, given its scalability and accessibility benefits, although web-based applications will continue to hold a significant market share. Geographic growth will be diverse, with North America and Europe maintaining strong positions due to advanced analytics adoption, but significant expansion is also expected in the Asia-Pacific region as technological advancement and data infrastructure improve. However, challenges like data privacy concerns, the need for skilled professionals, and the high cost of advanced software solutions could act as market restraints in certain regions. The competitive landscape is marked by a mix of established players such as IBM, Microsoft, and TIBCO Software, along with a growing number of specialized vendors and emerging technology companies. The market is characterized by ongoing innovation in areas like algorithm development, enhanced visualization capabilities, and the integration of cluster analysis with other advanced analytics tools. This continuous innovation will be a key driver in sustaining the market's high CAGR and ensuring its continued growth in the coming years. Increased focus on providing tailored solutions for specific industry verticals will likely be a strategic advantage for vendors seeking a competitive edge. The market's future hinges on its ability to effectively address the challenges of data complexity, security, and user-friendliness while continuing to deliver accurate and actionable insights.

Diverse Topologies for Evaluation of Geometric Similarity Metrics

zenodo.org
data.niaid.nih.gov

zip

Updated Mar 16, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Nivesh Dommaraju; Nivesh Dommaraju; Mariusz Bujny; Mariusz Bujny; Stefan Menzel; Markus Olhofer; Fabian Duddeck; Fabian Duddeck; Stefan Menzel; Markus Olhofer (2022). Diverse Topologies for Evaluation of Geometric Similarity Metrics [Dataset]. http://doi.org/10.5281/zenodo.6323251

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6323251

Dataset updated

Mar 16, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nivesh Dommaraju; Nivesh Dommaraju; Mariusz Bujny; Mariusz Bujny; Stefan Menzel; Markus Olhofer; Fabian Duddeck; Fabian Duddeck; Stefan Menzel; Markus Olhofer

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A collection of 7 datasets with each set containing 3D shapes with varying topological complexity. The datasets can be used to compare different metrics of geometric dissimilarity. Two of the datasets have topologically complex shapes that resemble designs obtained from topology optimization, a widely used design optimization method for engineering structures.

We used this dataset for a related journal article with the following abstract: "In the early stages of engineering design, multitudes of feasible designs can be generated using structural optimization methods by varying the design requirements or user preferences for different performance objectives. Data mining such potentially large datasets is a challenging task. An unsupervised data-centric approach for exploring designs is to find clusters of similar designs and recommend only the cluster representatives for review. Design similarity can be defined not only on a purely functional level but also based on geometric properties, such as size, shape, and topology. While metrics such as chamfer distance measure the geometrical differences intuitively, it is more useful for design exploration to use metrics based on geometric features, which are extracted from high-dimensional 3D geometric data using dimensionality reduction techniques. If the Euclidean distance in the geometric features is meaningful, the features can be combined with performance attributes resulting in an aggregate feature vector that can potentially be useful in design exploration based on both geometry and performance. We propose a novel approach to evaluate such derived metrics by measuring their similarity with the metrics commonly used in 3D object classification. Furthermore, we measure clustering accuracy, which is a state-of-the-art unsupervised approach to evaluate metrics. For this purpose, we use a labeled, synthetic dataset with topologically complex designs. From our results, we conclude that Pointcloud Autoencoder is promising in encoding geometric features and developing a comprehensive design exploration method."

For each dataset, shapes/designs are saved as surface mesh files (extension: stl) and point cloud files (extension: ply) in the folders "stls" and "plys" respectively. A brief description of the 7 different datasets is in the following table. For each dataset, the designs are named using numbers starting from 0, e.g., “0.stl, 1.stl, …, 19.stl” in the folder for the surface mesh files. Some of the datasets are labeled, i.e., each design belongs to a class. In a labeled dataset, all classes have the same number of designs, and the designs are named in the order of their class. For example, a labeled dataset with 4 designs and 2 classes contains files whose names start with {0, 1, 2, 3} where the designs {0, 1} belong to class 1, and {2, 3} belong to class 2.

Dataset name	Directory name	Number of designs	Number of classes
Beam-rotation	"rotate_beam"	20	None
Beam-elongation	"elongate_beam"	20	None
Beam-translation	"move_beam"	20	None
Three cube trusses	"three_cube_truss"	150	6
Single cube trusses	"single_cube_truss"	275	11
Random topologies	"three_cube_truss_random"	1000	50
Topologically optimized designs	"cube_opt_shapes"	1500	None

Facebook

Twitter

Click to copy link

Link copied

Cite

(2018). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.nasa.gov/dataset/MULTI-LABEL-ASRS-DATASET-CLASSIFICATION-USING-SEMI/m4h6-922m

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

Explore at:

csv, application/rssxml, tsv, xml, application/rdfxml, jsonAvailable download formats

Dataset updated

Jun 26, 2018

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI

Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

Clear search

Close search

Google apps

Main menu

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

Cell type labels for all clustering and normalization combinations compared...

Dataset - Clustering Semantic Predicates in the Open Research Knowledge...

A labeled Ecore metamodel dataset for domain clustering

Data from: Pseudo-Label Generation for Multi-Label Text Classification

Data from: Zero-shot Bilingual App Reviews Mining with Large Language Models...

Classification

Clustering

8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering...

Childhood Cancer Cluster Simulation - Cancer in Brief Manuscript Schündeln...

Robust phenotyping of highly multiplexed tissue imaging data using...

Table_1_Strategies for Accurate Cell Type Identification in CODEX...

Data from: Semi-supervised non-negative matrix factorization with structure...

Supplementary codes and datasets for "Modular-topology optimization of...

Pacific white-sided dolphin hourly binned echolocation clicks

DataSheet3_Molecular Characterization of the Highest Risk Adult Patients...

Raman spectral data for mature mouse placenta scans

Meta-analytic clustering dissociates brain activity and behavior profiles...

Collection description

Subject species

Modality

Analysis level

Cognitive paradigm (task)

Map type

Detection and classification of beaked whale echolocation clicks recorded on...

Table2_PolyReco: A Method to Automatically Label Collinear Regions and...

Cluster Analysis Software Report

Diverse Topologies for Evaluation of Geometric Similarity Metrics

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERINGSee More Versions

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING