65 datasets found
  1. MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    application/rdfxml +5
    Updated Jun 26, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.nasa.gov/dataset/MULTI-LABEL-ASRS-DATASET-CLASSIFICATION-USING-SEMI/m4h6-922m
    Explore at:
    csv, application/rssxml, tsv, xml, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

    MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI

    Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

  2. Cell type labels for all clustering and normalization combinations compared...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Hickey (2022). Cell type labels for all clustering and normalization combinations compared for CODEX multiplexed imaging [Dataset]. http://doi.org/10.5061/dryad.dfn2z352c
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 17, 2022
    Dataset provided by
    Stanford University
    Authors
    John Hickey
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    We performed CODEX (co-detection by indexing) multiplexed imaging on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), and single cell segmentation. Output of this process was a dataframe of nearly 130,000 cells with fluorescence values quantified from each marker. We used this dataframe as input to 1 of the 5 normalization techniques of which we compared z, double-log(z), min/max, and arcsinh normalizations to the original unmodified dataset. We used these normalized dataframes as inputs for 4 unsupervised clustering algorithms: k-means, leiden, X-shift euclidian, and X-shift angular.

    From the clustering outputs, we then labeled the clusters that resulted for cells observed in the data producing 20 unique cell type labels. We also labeled cell types by hiearchical hand-gating data within cellengine (cellengine.com). We also created another gold standard for comparison by overclustering unormalized data with X-shift angular clustering. Finally, we created one last label as the major cell type call from each cell from all 21 cell type labels in the dataset.

    Consequently the dataset has individual cells segmented out in each row. Then there are columns for the X, Y position in pixels in the overall montage image of the dataset. There are also columns to indicate which region the data came from (4 total). The rest are labels generated by all the clustering and normalization techniques used in the manuscript and what were compared to each other. These also were the data that were used for neighborhood analysis for the last figure of the manuscript. These are provided at all four levels of cell type level granularity (from 7 cell types to 35 cell types).

  3. Z

    Dataset - Clustering Semantic Predicates in the Open Research Knowledge...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arab Oghli, Omar (2022). Dataset - Clustering Semantic Predicates in the Open Research Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6513498
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Arab Oghli, Omar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

    The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.

    data.json

    The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.

    { "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }

    training_set.json and test_set.json

    The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.

    Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.

    { "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }

    Dataset Statistics:

        -
        Papers
        Predicates
        Research Fields
        Research Problems
    
    
    
    
        Min/Comparison
        2
        2
        1
        0
    
    
        Max/Comparison
        202
        112
        5
        23
    
    
        Avg./Comparison
        21,54
        12,79
        1,20
        1,09
    
    
        Total
        4060
        1816
        46
        178
    

    Dataset Splits:

        -
        Papers
        Comparisons
    
    
    
    
        Training Set
        2857
        214
    
    
        Test Set
        1203
        180
    
  4. o

    A labeled Ecore metamodel dataset for domain clustering

    • explore.openaire.eu
    • zenodo.org
    Updated Mar 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Önder Babur (2019). A labeled Ecore metamodel dataset for domain clustering [Dataset]. http://doi.org/10.5281/zenodo.2585431
    Explore at:
    Dataset updated
    Mar 6, 2019
    Authors
    Önder Babur
    Description

    Manually labeled 555 metamodels mined from GitHub in April 2017. Domains: (1) bibliography, (2) conference management, (3) bug/issue tracker, (4) build systems, (5) document/office products, (6) requirement/use case, (7) database/sql, (8) state machines, (9) petri nets Procedure for constructing the dataset: fully manual, by searching for certain keywords and regexes (e.g. "state" and "transition" for state machines) in the metamodels and inspecting the results for inclusion. Format for the file names: ABSINDEX_CLUSTER_ITEMINDEX_name_hash.ecore

  5. d

    Data from: Pseudo-Label Generation for Multi-Label Text Classification

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Dec 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  6. z

    Data from: Zero-shot Bilingual App Reviews Mining with Large Language Models...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray; Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray (2024). Zero-shot Bilingual App Reviews Mining with Large Language Models [Dataset]. http://doi.org/10.1109/ictai59109.2023.00135
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2024
    Dataset provided by
    IEEE
    Authors
    Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray; Jialiang Wei; Anne-Lise Courbis; Thomas Lambolais; Binbin Xu; Pierre Louis Bernard; Gérard Dray
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Classification

    6000 English and 6000 French user reviews from three applications on Google Play (Garmin Connect, Huawei Health, Samsung Health) are labelled manually. We employed three labels: problem report, feature request, and irrelevant.

    • Problem reports show the issues the users have experienced while using the app.
    • Feature requests reflect the demande of users on new function, new content, new interface, etc.
    • Irrelevant are the user reviews that do not belongs to the two aforementioned categories.

    As we can observe from the following table, that shows examples of labelled user reviews, each review belongs to one or more categories.

    AppLanguageTotalFeature requestProblem reportIrrelevant
    Garmin Connecten20002235791231
    Garmin Connectfr20002177721051
    Huawei Healthen2000415876764
    Huawei Healthfr2000387842817
    Samsung Healthen2000528500990
    Samsung Healthfr20004964921047

    Clustering

    1200 bilingual labeled user reviews for clustering evaluation. From each of the three applications and for each of the two languages present in the classification dataset, we randomly selected 100 problem reports and 100 feature requests. Subsequently, we conducted manual clustering on each collection of 200 bilingual reviews, all of which pertained to the same category.

    Garmin ConnectHuawei HealthSamsung Health
    #clusters in feature request897469
    #clusters(𝑠𝑖𝑧𝑒≥5) in feature request7911
    #clusters in problem report454441
    #clusters(𝑠𝑖𝑧𝑒≥5) in problem report101312

  7. 8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering...

    • zenodo.org
    csv
    Updated Jun 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vicki Toy-Edens; Vicki Toy-Edens; Wenli Mo; Wenli Mo; Savvas Raptis; Savvas Raptis; Drew Turner; Drew Turner (2024). 8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering plasma regions classifications [Dataset]. http://doi.org/10.5281/zenodo.11032322
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vicki Toy-Edens; Vicki Toy-Edens; Wenli Mo; Wenli Mo; Savvas Raptis; Savvas Raptis; Drew Turner; Drew Turner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 11, 2024
    Description

    These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“

    We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (DOI:10.1029/2024JA032431).

    This work was funded by grant 2225463 from the NSF GEM program.

    The following tables detail the contents of the described files:

    labeled_sunside_data.csv description

    Column Name

    Description

    Epoch

    Epoch in datetime

    probe

    MMS probe name

    ratio_max_width

    Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information

    ratio_high_low

    Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information

    norm_Btot

    Magnitude of the total magnetic field normalized to 50nT. See paper for more information

    small_energy_mean

    The denominator in ratio_high_low

    large_energy_mean

    The numerator in ratio_high_low

    temp_total

    Total temperature from the DIS moments. See paper for more information

    r_gse_x

    x position of the spacecraft in GSE

    r_gse_y

    y position of the spacecraft in GSE

    r_gse_z

    z position of the spacecraft in GSE

    r_gsm_x

    x position of the spacecraft in GSM

    r_gsm_y

    y position of the spacecraft in GSM

    r_gsm_z

    z position of the spacecraft in GSM

    mlat

    magnetic latitude of spacecraft

    mlt

    magnetic local time of spacecraft

    raw_named_label

    Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock)

    modified_named_label

    Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information

    transition_name

    Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information

    Column Name

    Description

    start

    Starting Epoch in datetime

    stop

    Stopping Epoch in datetime

    probe

    MMS probe name

    region

    Cleansed cluster name associated with 1-minute resolution “modified_named_label”

  8. m

    Childhood Cancer Cluster Simulation - Cancer in Brief Manuscript Schündeln...

    • data.mendeley.com
    • narcis.nl
    Updated Nov 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Schündeln (2020). Childhood Cancer Cluster Simulation - Cancer in Brief Manuscript Schündeln et al. 2020 [Dataset]. http://doi.org/10.17632/3hrg9tpsx9.2
    Explore at:
    Dataset updated
    Nov 2, 2020
    Authors
    Michael Schündeln
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Incidence of newly diagnosed childhood cancer (140/1,000,000 children under 15 years) and nephroblastoma (7/1,000,000) was simulated. Clusters of defined size (1 to 50) were randomly assembled on the district level in Germany. Each cluster was simulated with different relative risk levels (1 to 100). For each combination 2000 iterations were done. Simulated data was then analyzed by three local clustering tests: Besag-Newell method, spatial scan statistic and Bayesian Besag-York-Mollié with Integrated Nested Laplace Approximation approach. The operating characteristics of all three methods were systematically documented (sensitivity, specificity, positive/negative predictive values, exact and minimum power, correct classification, positive/negative diagnostic likelihood and false positive/negative rate).

    The performance of each of the various cluster detection methods and scenarios in this study is reported according to the quality criteria detailed below.

    Minimum Power (MP): Proportion of simulations detecting at least one district of the true cluster. Exact Power (EP): Proportion of simulations detecting the true cluster without false positives. Sensitivity (sens): Proportion of correctly detected districts in the true cluster. Specificity (spec): Percentage of normal risk districts, correctly classified as normal risk districts. Positive predictive value (PPV): Proportion of districts in the detected cluster belonging to the true cluster. Negative predictive value (NPV): Proportion of districts not labeled as a risk cluster that is not part of the true cluster. Correct classification (CC): Percentage of correctly classified districts of all districts. Correct proportion (CP): Correctly labeled districts of all detected potential high-risk districts. Positive diagnostic likelihood (PDL): The ratio of high-risk districts being detected, divided by the probability non-high-risk districts being detected (sensitivity / (1-specificity). Negative diagnostic likelihood (NDL): The ratio of high-risk districts not being detected divided by the probability of non-high-risk districts not being detected ((1 – sensitivity) /specificity). False positive rate (FPR): Incorrectly labeled high-risk districts of all detected high-risk districts False negative rate (FNR): Incorrectly labeled normal-risk districts of all detected normal-risk districts

  9. Robust phenotyping of highly multiplexed tissue imaging data using...

    • zenodo.org
    zip
    Updated Jul 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Candace C Liu; Michael Angelo; Candace C Liu; Michael Angelo (2023). Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering (lymph node MIBI-TOF data) [Dataset]. http://doi.org/10.5281/zenodo.8096953
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 6, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Candace C Liu; Michael Angelo; Candace C Liu; Michael Angelo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MIBI-TOF data for lymph node dataset reported in Liu et al., Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering

    1. mibi_single_channel_tifs.zip: Single-channel MIBI-TOF images

    Folders are labeled according to the field-of-view (FOV) number. Each folder contains single-channel TIFFs for each marker in the panel. Images are 1024x1024 pixels, 500 um. See paper for details.

    2. segmentation.zip: Segmentation output of MIBI-TOF images

    Cell segmentation was performed using Mesmer (Greenwald NF, Nature Biotechnology 2021). Output of Mesmer that delineates the single cells in each of the images is included.

    3. source_data.zip: Source data files for figures

    • pixel_ccs_allpreprocessing.csv: Cluster consistency score (CCS) for all pixels using all preprocessing steps, related to Fig. 2d-f, Supp. Fig. 4,5,9,10
    • pixel_ccs_nopixelnorm.csv: CCS for all pixels where pixel normalization was left out, related to Fig. 2f, Supp. Fig. 6
    • pixel_ccs_nochannelnorm.csv: CCS for all pixels where channel normalization was left out, related to Fig. 2f, Supp. Fig. 8
    • pixel_ccs_passes1.csv: CCS for all pixels where 1 pass was used for SOM training, related to Supp. Fig. 10
    • pixel_ccs_passes100.csv: CCS for all pixels where 100 passes were used for SOM training, related to Supp. Fig. 10
    • pixel_ccs_sigma0.csv: CCS for all pixels where a Gaussian blur sigma of 0 was used for preprocessing, related to Supp. Fig. 5
    • pixel_ccs_sigma1.csv: CCS for all pixels where a Gaussian blur sigma of 1 was used for preprocessing, related to Supp. Fig. 5
    • pixel_ccs_sigma3.csv: CCS for all pixels where a Gaussian blur sigma of 3 was used for preprocessing, related to Supp. Fig. 5
    • pixel_ccs_nodes15.csv: CCS for all pixels where 15 nodes were used for SOM training, related to Supp. Fig. 9
    • pixel_ccs_threshold80.csv: CCS for all pixels where a threshold of 80% was used for CCS calculation, related to Supp. Fig. 4b
    • pixel_ccs_threshold98.csv: CCS for all pixels where a threshold of 98% was used for CCS calculation, related to Supp. Fig. 4b
    • pixel_info_comparison_table.csv: Number of pixels that were assigned to a cluster outside of cell segmentation masks, related to Fig. 3d
    • single_cell_pixel_composition_table.csv: Pixel composition information for each single cell, related to Fig. 5, Supp. Fig 16
    • single_cell_integrated_expression_table.csv: Integrated expression per cell, output by Mesmer, related to Fig. 5, Supp. Fig. 16
    • cell_silhouette_scores.csv: Silhouette scores for comparing integrated expression and pixel composition, related to Fig. 5d
    • cell_ccs_pixel_composition.csv: CCS for all cells using pixel composition for clustering, related to Supp. Fig. 16e, 17c
    • cell_ccs_integrated_expression.csv: CCS for all cells using integrated expression for clustering, related to Supp. Fig 16e-f
    • cell_ccs_integrated_expression_preprocessed.csv: CCS for all cells using integrated expression for clustering where data was preprocessed before integrating, related to Supp. Fig 17
    • cytof_ccs.csv: CCS of the CyTOF dataset used as a benchmark, related to Supp. Fig. 4c,d
    • scrnaseq_ccs.csv: CCS of the scRNA-seq dataset used as a benchmark, related to Supp. Fig. 4c,e
    • pixel_phenotype_maps: TIFFs where pixel value corresponds to pixel cluster number as reported in the paper
    • cell_phenotype_maps: TIFFs where pixel value corresponds to cell cluster number as reported in the paper
  10. f

    Table_1_Strategies for Accurate Cell Type Identification in CODEX...

    • frontiersin.figshare.com
    xlsx
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John W. Hickey; Yuqi Tan; Garry P. Nolan; Yury Goltsev (2023). Table_1_Strategies for Accurate Cell Type Identification in CODEX Multiplexed Imaging Data.xlsx [Dataset]. http://doi.org/10.3389/fimmu.2021.727626.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    John W. Hickey; Yuqi Tan; Garry P. Nolan; Yury Goltsev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiplexed imaging is a recently developed and powerful single-cell biology research tool. However, it presents new sources of technical noise that are distinct from other types of single-cell data, necessitating new practices for single-cell multiplexed imaging processing and analysis, particularly regarding cell-type identification. Here we created single-cell multiplexed imaging datasets by performing CODEX on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. After cell segmentation, we implemented five different normalization techniques crossed with four unsupervised clustering algorithms, resulting in 20 unique cell-type annotations for the same dataset. We generated two standard annotations: hand-gated cell types and cell types produced by over-clustering with spatial verification. We then compared these annotations at four levels of cell-type granularity. First, increasing cell-type granularity led to decreased labeling accuracy; therefore, subtle phenotype annotations should be avoided at the clustering step. Second, accuracy in cell-type identification varied more with normalization choice than with clustering algorithm. Third, unsupervised clustering better accounted for segmentation noise during cell-type annotation than hand-gating. Fourth, Z-score normalization was generally effective in mitigating the effects of noise from single-cell multiplexed imaging. Variation in cell-type identification will lead to significant differential spatial results such as cellular neighborhood analysis; consequently, we also make recommendations for accurately assigning cell-type labels to CODEX multiplexed imaging.

  11. m

    Data from: Semi-supervised non-negative matrix factorization with structure...

    • data.mendeley.com
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenjing Jing (2024). Semi-supervised non-negative matrix factorization with structure preserving for image clustering [Dataset]. http://doi.org/10.17632/gf67wvrhbs.1
    Explore at:
    Dataset updated
    Dec 9, 2024
    Authors
    Wenjing Jing
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The code for paper '' Semi-supervised non-negative matrix factorization with structure preserving for image clustering''. This paper constructs a new label matrix with weights and further construct a label constraint regularizer to both utilize the label information and maintain the intrinsic structure of NMF. Based on the label constraint regularizer, the basis images of labeled data are extracted for monitoring and modifying the basis images learning of all data by establishing a basis regularizer. By incorporating the label constraint regularizer and the basis regularizer into NMF, a new semi-supervised NMF method is introduced. The proposed method is applied to image clustering and experimental results demonstrate the effectiveness of the proposed method in contrast with state-of-the-art unsupervised and semi-supervised algorithms.

  12. Supplementary codes and datasets for "Modular-topology optimization of...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marek Tyburec; Marek Tyburec; Martin Doškář; Martin Doškář; Jan Zeman; Jan Zeman; Martin Kružík; Martin Kružík (2023). Supplementary codes and datasets for "Modular-topology optimization of structures and mechanisms with free material design and clustering" [Dataset]. http://doi.org/10.5281/zenodo.5714298
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marek Tyburec; Marek Tyburec; Martin Doškář; Martin Doškář; Jan Zeman; Jan Zeman; Martin Kružík; Martin Kružík
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository supports manuscript “Modular-topology optimization of structures and mechanisms with free material design and clustering” by M. Tyburec, M. Doškář, J. Zeman, and M. Kružík, first published as preprint 2111.10439 at arXiv.org.

    This repository contains:

    1. MATLAB source codes for (modular) free material optimisation and hierarchical stiffness clustering (folder ./mFMO/)
    2. C++ source codes for modular topology optimization (folder ./MTO/)
    3. Input/output data of the test suite (folder ./data/)

    1. Data flow

    The test suite considered in the manuscript covers 4 problems:

    1. Messerschmitt-Bölkow-Blohm beam (labelled as mbb)
    2. Inverter compliant mechanism (labelled as inv)
    3. Gripper compliant mechanism (labelled as grip)
    4. Reusable design of both compliant mechanisms (labelled as invgrip)

    Each problem in the dataset is stored within a separate subfolder named according to the labels mentioned above. The final level of subdirectories {X}color comprises of the results for problems with X denoting the number of edge codes considered for each edge direction during the clustering (0color stands for a non-modular design and 1color represents the design based on Periodic Unit Cell).

    Each of the folders contains outputs of the modular free material optimisation in the following form:

    • {label}{X}.mat
    • {label}{X}.til
    • {label}{X}.tset
    • {label}{X}guess.mat

    Files *.til, *.tset, and *guess.mat are then converted into a JSON input file for the modular topology optimization code with generator scripts which can be found in ./MTO/scripts folder. Note that each of the problems in the test suite has its own generator script generate_modular_problem_{MBB,inverter,gripper,inverterAndGripper}.mat. The generator scripts make a directory named according to the key MTO_{n}_kernelSensitivity, where n denotes the resolution of each module (i.e. the number of nodes along one direction). The directory also contains the outputs of the modular topology optimisation in the form of the initial and the final state of the optimization in VTK files and visualisation of the final state in SVG files. The log file log.txt stores the optimized objective and progress of the value along with stopping criteria quantities during iterations.

    2. Running codes

    2.1 Modular free material optimisation

    MATLAB scripts and functions for (modular) Free Material Optimization (FMO) are contained in the mFMO data folder. The codes have been tested with MATLAB R2019b. To run the codes the user is required to install the PENNON optimizer. A free academic license is provided by its authors on request.

    Input files for individual problems are defined in the mFMO/problems folder and are launched with the runproblem(problemName, numClusters), where problemName refers to the file in the mFMO/problems folder without the file extension and numClusters denotes the maximum number of color codes in Wang tiling formalism.

    If successful, the optimization produces output files in mFMO/fmo_fig/{label}/{X}colors/{T}/:

    • {label}{X}.mat (contains clustering and tiling information)
    • {label}{X}_tmp.mat (contains results of non-modular FMO)
    • {label}{X}.til (the assembly plan)
    • {label}{X}.tset (Wang tile set)
    • {label}{X}guess.mat (guess for TO)

    where T is the optimization time stamp.

    2.2 Modular topology optimisation

    All results were obtained with version v0.1.0, which is also provided in the folder MTO, and linked Intel® oneAPI Math Kernel Library and the incorporated PARDISO sparse solver. For the recent development of the code see the open git repository at https://gitlab.com/MartinDoskar/modular-topology-optimization. The repository also contains a detailed description of input parameters and code design.

    Modular topology optimisation code uses CMake for the cross-platform build automation. For instance, under Linux, the whole code can be compiled in the standard five steps:

    cd ./MTO
    mkdir build
    cd ./build
    cmake -DCMAKE_BUILD_TYPE=Release ..
    make
    

    All executables are automatically stored in ./MTO/bin/ folder. Individual problems can be optimized by parsing the JSON files obtained from the generator scripts as an argument to the MTO.Application binary, e.g.,

    ./MTO/bin/MTO.Application.exe path_to_data/mbb/2color/MTO_100_kernelSensitivity/input_modular_mbb_2colours_100.json
    

    Acknowledgement

    The related research and code development was supported by the Czech Science Foundation, project No. 19-26143X.

  13. d

    Pacific white-sided dolphin hourly binned echolocation clicks

    • search.dataone.org
    • dataone.org
    • +2more
    Updated Jul 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michaela N. Alksne (2024). Pacific white-sided dolphin hourly binned echolocation clicks [Dataset]. https://search.dataone.org/view/sha256%3Ae073ff5e24b680e3f441e7d8178f242d25acafed4089485cd7da564d66c8d51b
    Explore at:
    Dataset updated
    Jul 3, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Michaela N. Alksne
    Time period covered
    Jan 1, 2024
    Description

    This study investigates the biogeographic patterns of Pacific white-sided dolphins (Lagenorhynchus obliquidens) in the Eastern North Pacific based on long-term passive acoustic records (2005-2021). We aim to elucidate the ecological and behavioral significance of distinct echolocation click types and their implications for population delineation, geographic distribution, environmental adaptation, and management. Over 50 cumulative years of Passive Acoustic Monitoring (PAM) data from 14 locations were analyzed using a deep neural network to classify two distinct Pacific white-sided dolphin echolocation click types. The study assessed spatial, diel, seasonal, and interannual patterns of the two click types, correlating them with major environmental drivers such as the El Niño Southern Oscillation and the North Pacific Gyre Oscillation, and modeling long-term spatial-seasonal patterns. Distinct spatial, seasonal, and diel patterns were observed for each click type. Significant biogeographi..., Raw acoustic data was passed through a click detector which returned all acoustic signals within an expected frequency range and duration of odontocete echolocation clicks. An unsupervised clustering algorithm was run on the detections to group them into 5-minute bin-level averages. Cluster bins were then labeled as one of six categories by a trained neural network. Clusters labeled as either one of two Pacific white-sided dolphin click types were extracted and manually verified. Verified pacific white-sided dolphin detections were then binned into 'click-positive minutes per hour', where a click positive minute was a minute that contained any number of clicks. The timeseries of click-positive minutes per hour, for each click type, at multiple long-term recording locations, is included here. , , # Pacific white-sided dolphin hourly binned echolocation clicks

    https://doi.org/10.5061/dryad.95x69p8rj

    Each CSV file contains the hourly acoustic presence of Pacific white-sided dolphin echolocation clicks. The files are formatted such that the click type and location are stored in the file header. For instance, "SCB_LoA.csv" represents the hourly presence of the LoA click type at recording station SCB_M.Â

    The recording effort has been included here as a CSV file titled "PWD_effort.csv". Therefore, a user can cross-reference the recording location, recording effort, and time series name. If the one-click type was not detected at a given recording site, then that time series was not included.

    Each of the datasets contains two columns:

    • The first column is the time bin in hours (deploymentHour)
    • The second is the number of click-positive minutes in that hour (Click_pos_min_per_hour), with a maximum of 60 positive minutes. If any number o...
  14. f

    DataSheet3_Molecular Characterization of the Highest Risk Adult Patients...

    • frontiersin.figshare.com
    txt
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trinh Nguyen; John W Pepper; Cu Nguyen; Yu Fan; Ying Hu; Qingrong Chen; Chunhua Yan; Daoud Meerzaman (2023). DataSheet3_Molecular Characterization of the Highest Risk Adult Patients With Acute Myeloid Leukemia (AML) Through Multi-Omics Clustering.CSV [Dataset]. http://doi.org/10.3389/fgene.2021.777094.s003
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Trinh Nguyen; John W Pepper; Cu Nguyen; Yu Fan; Ying Hu; Qingrong Chen; Chunhua Yan; Daoud Meerzaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Acute myeloid leukemia (AML) is a clinically heterogeneous group of cancers. While some patients respond well to chemotherapy, we describe here a subgroup with distinct molecular features that has very poor prognosis under chemotherapy. The classification of AML relies substantially on cytogenetics, but most cytogenetic abnormalities do not offer targets for development of targeted therapeutics. Therefore, it is important to create a detailed molecular characterization of the subgroup most in need of new targeted therapeutics.Methods: We used a multi-omics approach to identify a molecular subgroup with the worst response to chemotherapy, and to identify promising drug targets specifically for this AML subgroup.Results: Multi-omics clustering analysis resulted in three primary clusters among 166 AML adult cancer cases in TCGA data. One of these clusters, which we label as the high-risk molecular subgroup (HRMS), consisted of cases that responded very poorly to standard chemotherapy, with only about 10% survival to 2 years. The gene TP53 was mutated in most cases in this subgroup but not in all of them. The top six genes over-expressed in the HRMS subgroup included E2F4, CD34, CD109, MN1, MMLT3, and CD200. Multi-omics pathway analysis using RNA and CNA expression data identified in the HRMS subgroup over-activated pathways related to immune function, cell proliferation, and DNA damage.Conclusion: A distinct subgroup of AML patients are not successfully treated with chemotherapy, and urgently need targeted therapeutics based on the molecular features of this subgroup. Potential drug targets include over-expressed genes E2F4, and MN1, as well as mutations in TP53, and several over-activated molecular pathways.

  15. Raman spectral data for mature mouse placenta scans

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arda Inanc; Arda Inanc (2023). Raman spectral data for mature mouse placenta scans [Dataset]. http://doi.org/10.5281/zenodo.8076483
    Explore at:
    Dataset updated
    Jul 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Arda Inanc; Arda Inanc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pre-processed and normalized Raman spectral data for three mouse placental tissue scans, and constructed image data at three different wavenumbers.

  16. N

    Meta-analytic clustering dissociates brain activity and behavior profiles...

    • neurovault.org
    nifti
    Updated Oct 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Meta-analytic clustering dissociates brain activity and behavior profiles across reward processing paradigms: k=5_MAG-1 [Dataset]. http://identifiers.org/neurovault.image:124625
    Explore at:
    niftiAvailable download formats
    Dataset updated
    Oct 20, 2019
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    glassbrain

    Collection description

    We employed a data-driven, meta-analytic clustering approach to an extensive body of reward processing neuroimaging results archived in the BrainMap database (www.brainmap.org) to characterize meta-analytic groupings (MAGs) of reward processing experiments based on the spatial similarity of brain activation patterns. Using a data-driven, meta-analytic, k-means clustering approach, we dissociated five meta-analytic groupings (MAGs) of neuroimaging results (i.e., brain activation maps) from 749 experimental contrasts across 177 reward processing studies involving 13,345 healthy participants. We objectively identified a five-MAG solution which represented dissociated patterns of activation consistently occurring across reward processing tasks (MAG-1: ventral-striatal; MAG-2: dorsal-striatal; MAG-3: limbic-parietal; MAG-4: frontal-parietal; MAG-5: medial frontal-posterior cingulate). The optimal clustering-solution was selected based on majority rule of four information-theoretic metrics and, subsequently, convergent brain activity across each grouping of neuroimaging experiments was quantified via separate meta-analyses.

    To compile a large corpus of neuroimaging results across reward processing paradigms, we extracted activation coordinates reported in published studies that were archived in the BrainMap Database as of April 22, 2016, under the meta-data labels Reward, Delay Discounting, and Gambling (www.brainmap.org) (Fox et al., 2005; Fox & Lancaster, 2002; Laird et al., 2011). The vast majority (94.9%) of identified studies were archived under the Reward label with most Delay Discounting and Gambling studies being additionally archived under Reward. The Reward label denotes that the reported activation coordinates were identified in a task where a stimulus served to reinforce a desired response (e.g., monetary reward after a correct response) (www.brainmap.org/taxonomy/paradigms). Almost all studies included in the corpus were also archived under a variety of other meta-data labels (e.g. Task Switching (6.4%), Go/No-Go (2.9%), Visuospatial Attention (2.9%), Reasoning/Problem Solving (1.3%), Wisconsin Card Sorting Test (2.6%)) which is unsurprising as reward processing is a multifaceted construct, connecting elements of sensation, perception, cognitive control, and other mental operations.
    We considered only activation coordinates from published neuroimaging studies, among healthy participants, that were reported in standard Talairach (Talairach & Tournoux, 1988) or Montreal Neurological Institute (MNI) (Collins, 1994) space and derived from whole-brain statistical comparisons. Brain coordinates derived through behavioral correlations or a priori region of interest (ROI) analyses were excluded. As this meta-analysis aimed to investigate brain activation linked with typical reward processing, coordinates from groups of individuals with psychological or neuropsychiatric disorders (e.g., addictive disorders) were excluded from the corpus. Each included study provided at least one experimental contrast that statistically identified brain activity associated with a certain task-event defined by the original authors (e.g., a brain activity map). These experimental contrasts were summarized and curated in the BrainMap database as a set of brain activity foci linked either with phases of the original task (i.e., task response, anticipation of outcome, outcome delivery) or stimuli presented in the task (i.e., positive outcome, negative outcome, high reward, low reward). Foci from experimental contrasts can also reflect locations of brain activity linked with more abstract and computationally derived constructs of interest in the original study (e.g., learning rate, subjective value).

    Subject species

    homo sapiens

    Modality

    fMRI-BOLD

    Analysis level

    meta-analysis

    Cognitive paradigm (task)

    None / Other

    Map type

    Z

  17. d

    Detection and classification of beaked whale echolocation clicks recorded on...

    • search.dataone.org
    • data.griidc.org
    Updated Jul 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GRIIDC (2019). Detection and classification of beaked whale echolocation clicks recorded on bottom-moored EARS buoys in the northern Gulf of Mexico from July-October 2015 [Dataset]. https://search.dataone.org/view/R4-x261-000-0014-0003
    Explore at:
    Dataset updated
    Jul 9, 2019
    Dataset provided by
    GRIIDC
    Time period covered
    Jul 4, 2015 - Oct 11, 2015
    Area covered
    Description

    This dataset contains a subset of LADC passive acoustic system EARS buoys data which was collected in 2015 (data inventory can be found in R4.x261.233:0005) and is used to identify three different species of beaked whales in the Gulf of Mexico. The species of beaked whales examined in this dataset are Cuvier’s beaked whale, Gervais’ beaked whale, and an unidentified species that we labeled "BWG", which stands for Beaked Whale of the Gulf. Recordings were processed using a click detection algorithm. Then unsupervised as well as supervised classification algorithms were evaluated for distinguishing species by echolocation features.

  18. f

    Table2_PolyReco: A Method to Automatically Label Collinear Regions and...

    • frontiersin.figshare.com
    docx
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fushun Wang; Kang Zhang; Ruolan Zhang; Hongquan Liu; Weijin Zhang; Zhanxiao Jia; Chunyang Wang (2023). Table2_PolyReco: A Method to Automatically Label Collinear Regions and Recognize Polyploidy Events Based on the KS Dotplot.DOCX [Dataset]. http://doi.org/10.3389/fgene.2022.842387.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Fushun Wang; Kang Zhang; Ruolan Zhang; Hongquan Liu; Weijin Zhang; Zhanxiao Jia; Chunyang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Polyploidization plays a critical role in producing new gene functions and promoting species evolution. Effective identification of polyploid types can be helpful in exploring the evolutionary mechanism. However, current methods for detecting polyploid types have some major limitations, such as being time-consuming and strong subjectivity, etc. In order to objectively and scientifically recognize collinearity fragments and polyploid types, we developed PolyReco method, which can automatically label collinear regions and recognize polyploidy events based on the KS dotplot. Combining with whole-genome collinearity analysis, PolyReco uses DBSCAN clustering method to cluster KS dots. According to the distance information in the x-axis and y-axis directions between the categories, the clustering results are merged based on certain rules to obtain the collinear regions, automatically recognize and label collinear fragments. According to the information of the labeled collinear regions on the y-axis, the polyploidization recognition algorithm is used to exhaustively combine and obtain the genetic collinearity evaluation index of each combination, and then draw the genetic collinearity evaluation index graph. Based on the inflection point on the graph, polyploid types and related chromosomes with polyploidy signal can be detected. The validation experiments showed that the conclusions of PolyReco were consistent with the previous study, which verified the effectiveness of this method. It is expected that this approach can become a reference architecture for other polyploid types classification methods.

  19. C

    Cluster Analysis Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Cluster Analysis Software Report [Dataset]. https://www.archivemarketresearch.com/reports/cluster-analysis-software-59553
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for Cluster Analysis Software is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for advanced data interpretation across diverse sectors. While precise market sizing data is unavailable, considering the growth observed in related fields like data analytics and AI, a reasonable estimate for the 2025 market size could be placed between $2.5 billion and $3 billion. This estimate assumes a moderate growth trajectory reflecting the maturation of the cluster analysis market and the ongoing integration of these tools into broader business intelligence platforms. Assuming a Compound Annual Growth Rate (CAGR) of 15% for the forecast period (2025-2033), the market is projected to reach a substantial size within the next decade. This growth is fueled by several key drivers, including the expanding availability of large datasets, the growing demand for data-driven decision-making across industries like BFSI (Banking, Financial Services, and Insurance), government, and commercial sectors, and the continuous development of more sophisticated algorithms and user-friendly interfaces for cluster analysis software. The cloud-based segment is expected to dominate, given its scalability and accessibility benefits, although web-based applications will continue to hold a significant market share. Geographic growth will be diverse, with North America and Europe maintaining strong positions due to advanced analytics adoption, but significant expansion is also expected in the Asia-Pacific region as technological advancement and data infrastructure improve. However, challenges like data privacy concerns, the need for skilled professionals, and the high cost of advanced software solutions could act as market restraints in certain regions. The competitive landscape is marked by a mix of established players such as IBM, Microsoft, and TIBCO Software, along with a growing number of specialized vendors and emerging technology companies. The market is characterized by ongoing innovation in areas like algorithm development, enhanced visualization capabilities, and the integration of cluster analysis with other advanced analytics tools. This continuous innovation will be a key driver in sustaining the market's high CAGR and ensuring its continued growth in the coming years. Increased focus on providing tailored solutions for specific industry verticals will likely be a strategic advantage for vendors seeking a competitive edge. The market's future hinges on its ability to effectively address the challenges of data complexity, security, and user-friendliness while continuing to deliver accurate and actionable insights.

  20. Diverse Topologies for Evaluation of Geometric Similarity Metrics

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nivesh Dommaraju; Nivesh Dommaraju; Mariusz Bujny; Mariusz Bujny; Stefan Menzel; Markus Olhofer; Fabian Duddeck; Fabian Duddeck; Stefan Menzel; Markus Olhofer (2022). Diverse Topologies for Evaluation of Geometric Similarity Metrics [Dataset]. http://doi.org/10.5281/zenodo.6323251
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nivesh Dommaraju; Nivesh Dommaraju; Mariusz Bujny; Mariusz Bujny; Stefan Menzel; Markus Olhofer; Fabian Duddeck; Fabian Duddeck; Stefan Menzel; Markus Olhofer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 7 datasets with each set containing 3D shapes with varying topological complexity. The datasets can be used to compare different metrics of geometric dissimilarity. Two of the datasets have topologically complex shapes that resemble designs obtained from topology optimization, a widely used design optimization method for engineering structures.

    We used this dataset for a related journal article with the following abstract: "In the early stages of engineering design, multitudes of feasible designs can be generated using structural optimization methods by varying the design requirements or user preferences for different performance objectives. Data mining such potentially large datasets is a challenging task. An unsupervised data-centric approach for exploring designs is to find clusters of similar designs and recommend only the cluster representatives for review. Design similarity can be defined not only on a purely functional level but also based on geometric properties, such as size, shape, and topology. While metrics such as chamfer distance measure the geometrical differences intuitively, it is more useful for design exploration to use metrics based on geometric features, which are extracted from high-dimensional 3D geometric data using dimensionality reduction techniques. If the Euclidean distance in the geometric features is meaningful, the features can be combined with performance attributes resulting in an aggregate feature vector that can potentially be useful in design exploration based on both geometry and performance. We propose a novel approach to evaluate such derived metrics by measuring their similarity with the metrics commonly used in 3D object classification. Furthermore, we measure clustering accuracy, which is a state-of-the-art unsupervised approach to evaluate metrics. For this purpose, we use a labeled, synthetic dataset with topologically complex designs. From our results, we conclude that Pointcloud Autoencoder is promising in encoding geometric features and developing a comprehensive design exploration method."

    For each dataset, shapes/designs are saved as surface mesh files (extension: stl) and point cloud files (extension: ply) in the folders "stls" and "plys" respectively. A brief description of the 7 different datasets is in the following table. For each dataset, the designs are named using numbers starting from 0, e.g., “0.stl, 1.stl, …, 19.stl” in the folder for the surface mesh files. Some of the datasets are labeled, i.e., each design belongs to a class. In a labeled dataset, all classes have the same number of designs, and the designs are named in the order of their class. For example, a labeled dataset with 4 designs and 2 classes contains files whose names start with {0, 1, 2, 3} where the designs {0, 1} belong to class 1, and {2, 3} belong to class 2.

    Dataset nameDirectory nameNumber of designsNumber of classes
    Beam-rotation"rotate_beam"20None
    Beam-elongation"elongate_beam"20None
    Beam-translation"move_beam"20None
    Three cube trusses"three_cube_truss"1506
    Single cube trusses"single_cube_truss"27511
    Random topologies"three_cube_truss_random"100050
    Topologically optimized designs"cube_opt_shapes"1500None
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2018). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.nasa.gov/dataset/MULTI-LABEL-ASRS-DATASET-CLASSIFICATION-USING-SEMI/m4h6-922m
Organization logo

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

Explore at:
csv, application/rssxml, tsv, xml, application/rdfxml, jsonAvailable download formats
Dataset updated
Jun 26, 2018
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI

Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

Search
Clear search
Close search
Google apps
Main menu