56 datasets found
  1. Additional file 3 of Mugen-UMAP: UMAP visualization and clustering of...

    • figshare.com
    • springernature.figshare.com
    csv
    Updated Sep 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teng Li; Yiran Zou; Xianghan Li; Thomas K. F. Wong; Allen G. Rodrigo (2024). Additional file 3 of Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data [Dataset]. http://doi.org/10.6084/m9.figshare.27123950.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 28, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Teng Li; Yiran Zou; Xianghan Li; Thomas K. F. Wong; Allen G. Rodrigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary file 3. AnnData format of the 12 NSCLC patients dataset.

  2. n

    CalCENv1 co-expression network UMAP clusters

    • data.niaid.nih.gov
    Updated Dec 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa O'Meara; Matthew O'Meara (2020). CalCENv1 co-expression network UMAP clusters [Dataset]. https://data.niaid.nih.gov/resources?id=ds_4a1633821f
    Explore at:
    Dataset updated
    Dec 6, 2020
    Dataset provided by
    University of Michigan, Department of Computational Medicine and Bioinformatics, Ann Arbor, USA
    Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, USA
    Authors
    Teresa O'Meara; Matthew O'Meara
    Description

    CalCENv1 co-expression network was projected to two dimensions using UMAP and 18 clusters were identified and annotated through gene set enrichment analysis.

  3. Clustering files Iteration1

    • figshare.com
    txt
    Updated Apr 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Kulhankova; Eric Bindels; Daniel Kling; Manfred Kayser; Eskeatnaf Mulugeta (2023). Clustering files Iteration1 [Dataset]. http://doi.org/10.6084/m9.figshare.21790058.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucie Kulhankova; Eric Bindels; Daniel Kling; Manfred Kayser; Eskeatnaf Mulugeta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering files for figures including iteration 1 of each mixture.

  4. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

  5. f

    Library 2 UMAP clusters.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lagor, William R.; Beddow, Thomas; Huang, Qin; Sorensen, Hikari; Chan, Ken Y.; Chen, Albert T.; Azari, Bahar; Zheng, Qingxia; Walkey, Christopher J.; Heaney, Jason D.; Tobey, Isabelle G.; Zhao, Binhui; Moncada-Reid, Cynthia; Eid, Fatma-Elzahraa; Chan, Yujia A.; Deverman, Benjamin E.; Ljungberg, M. Cecilia; Barry, Andrew J. (2023). Library 2 UMAP clusters. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001038030
    Explore at:
    Dataset updated
    Jul 19, 2023
    Authors
    Lagor, William R.; Beddow, Thomas; Huang, Qin; Sorensen, Hikari; Chan, Ken Y.; Chen, Albert T.; Azari, Bahar; Zheng, Qingxia; Walkey, Christopher J.; Heaney, Jason D.; Tobey, Isabelle G.; Zhao, Binhui; Moncada-Reid, Cynthia; Eid, Fatma-Elzahraa; Chan, Yujia A.; Deverman, Benjamin E.; Ljungberg, M. Cecilia; Barry, Andrew J.
    Description

    Viruses have evolved the ability to bind and enter cells through interactions with a wide variety of cell macromolecules. We engineered peptide-modified adeno-associated virus (AAV) capsids that transduce the brain through the introduction of de novo interactions with 2 proteins expressed on the mouse blood–brain barrier (BBB), LY6A or LY6C1. The in vivo tropisms of these capsids are predictable as they are dependent on the cell- and strain-specific expression of their target protein. This approach generated hundreds of capsids with dramatically enhanced central nervous system (CNS) tropisms within a single round of screening in vitro and secondary validation in vivo thereby reducing the use of animals in comparison to conventional multi-round in vivo selections. The reproducible and quantitative data derived via this method enabled both saturation mutagenesis and machine learning (ML)-guided exploration of the capsid sequence space. Notably, during our validation process, we determined that nearly all published AAV capsids that were selected for their ability to cross the BBB in mice leverage either the LY6A or LY6C1 protein, which are not present in primates. This work demonstrates that AAV capsids can be directly targeted to specific proteins to generate potent gene delivery vectors with known mechanisms of action and predictable tropisms.

  6. Clustering files iteration 2

    • figshare.com
    txt
    Updated Apr 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Kulhankova; Diego Montiel Gonzáles; Eric Bindels; Daniel Kling; Manfred Kayser; Eskeatnaf Mulugeta (2023). Clustering files iteration 2 [Dataset]. http://doi.org/10.6084/m9.figshare.21790061.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucie Kulhankova; Diego Montiel Gonzáles; Eric Bindels; Daniel Kling; Manfred Kayser; Eskeatnaf Mulugeta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering files for each mixture iteration 2

  7. Additional file 1 of Single-cell characterisation of tissue homing CD4 + and...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dipabarna Bhattacharya; Jason Theodoropoulos; Katariina Nurmi; Timo Juutilainen; Kari K. Eklund; Riitta Koivuniemi; Tiina Kelkka; Satu Mustjoki; Tapio Lönnberg (2024). Additional file 1 of Single-cell characterisation of tissue homing CD4 + and CD8 + T cell clones in immune-mediated refractory arthritis [Dataset]. http://doi.org/10.6084/m9.figshare.25572505.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Dipabarna Bhattacharya; Jason Theodoropoulos; Katariina Nurmi; Timo Juutilainen; Kari K. Eklund; Riitta Koivuniemi; Tiina Kelkka; Satu Mustjoki; Tapio Lönnberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1: Figure S1. Phenotypic characterization of CD45 + cells in the cohort. a The patient cohort and the experimental design. Tissue types, cell sorting approaches, and sample processing batches have been highlighted. b Quality control of cells: analysis excluded cells that were considered low-quality based on the following criteria: > 15% reads from mitochondrially-encoded transcripts,  50% ribosomal transcripts,  4,500 expressed genes, or  20,000 UMI counts. c UMAP representation of all cells from the IMA samples (n = 3) profiled with scRNA + TCRαβ-seq without any batch correction. d) UMAP representation of the cells representing different tissue of origin in each cluster as identified in UMAP presented in 1c. Figure S2. Phenotypic characterization of CD45 + cells in the cohort (cont.). a UMAP representation of all CD45 + cells from IMA samples (n = 3) profiled with scRNA + TCRαβ-seq. Clusters were defined based on the expression of canonical markers and known cell surface expression of different CD45 + cell types. b The expression of canonical markers of different cell types in each cluster as identified in UMAP in 1b. c Proportion of cells from individual patients in each cluster. d Absolute number of cells from individual patients in each cluster. e UMAP representation of the IMA dataset split by individual patients. Figure S3. Phenotypic characterization of CD45 + cells in the cohort (cont.). a Feature plot of inflammation (LMNA, CREM) and activation (CXCL13, CCL5, CCL4 and RGCC) associated genes identified in the UMAP 1b. b Activation, inflammation, exhaustion, and inhibitory module scores as compared between tissues. c Activation, inflammation, exhaustion, and inhibitory module scores as compared between cells belonging to expanded versus not expanded clones. d Clonality index (Gini, higher Gini denotes more clonal) between ST and PB. e Clonality index (Gini) between ST and PB in individual samples. Figure S4. a Phenotypic characterization of CD4 + T cells in the cohort. Proportion of the cells from ST and PB in each of the CD4 + T cell clusters shown in Fig. 2b. b Relative proportion of the cells from ST and PB in each cluster as identified in the UMAP in Fig. 2b, each cluster represents 100% of the cell population. c Left: Proportion of the cells from individual patients as identified in the UMAP in Fig. 2b. Right: Proportion of the cells from individual samples as identified in the UMAP in Fig. 2b. d Expression of phenotypic markers in CD4 + T cell clusters. e Expression of phenotypic markers as identified in the UMAP 2b. f Clonal overlap (as measured by TCR similarity by Morisita index) between clusters as identified in the UMAP 2b. g Proportion of cells in different phases of cell cycle in each cluster as identified in the UMAP 2b in all CD4 + T cells in ST and PB. h The expression of proliferation-associated transcripts G0S2, FABP5, and MKI67 in ST compared to PB. (Bonferroni corrected two-sided t-test). Figure S5. a Phenotypic characterization of CD4 + cells in the cohort (cont.). Proliferation score (based on the expression of proliferation associated genes) of Expanded versus Non-expanded cells. b Samples were merged with scVI with tissue of origin as the batch key, to reduce batch effect. Left: UMAP representation and unsupervised clustering of the cells. Right: UMAP representation of the cells, coloured according to tissue origin. c UMAP representation of all cells from the Wu et al. cohort (Wu et al. 2021), coloured according to unsupervised clustering (left) and tissue origin (right). d A proliferation score (based on the expression of proliferation associated genes) of synovial membrane versus peripheral blood in the Wu et al. cohort (Wu et al. 2021). Figure S6. a Phenotypic characterization of CD8 + cells in the cohort. Proportion of the cells from ST and PB in each of the clusters shown in Fig. 2f. b Relative proportion of the cells from ST and PB in each cluster as identified in the UMAP in Fig. 2f, each cluster represents 100% of the cell population. c Left: Proportion of the cells from individual patients as identified in the UMAP in Fig. 2f. Right: Proportion of the cells from individual samples as identified in the UMAP in Fig. 2f. d Proportion of cells in different phases of cell cycle. e The expression of phenotypic markers in the clusters shown in Fig. 2f. Figure S7. a Clonal trafficking of CD4 + and CD8 + T cells between PB and ST. UMAP representation of all cells with intersecting clones between ST and PB as identified in the UMAP 2b. b UMAP representation of all cells with intersecting clones split by original patient between ST and PB as identified in the UMAP 2b. c Proportion of intersecting clones between ST and PB in different clusters as identified in the UMAP 2b. d Proportion of intersecting clones between ST and PB in different tissue of origin as identified in the UMAP 2b. e Antigen-specificities of the TCR repertoire from CD4 + cells matched against VDJdb. The most common target species have been highlighted. f Proportion of cells in different clusters identified with a unifying motif as predicted by GLIPH2. g Proportion of intersecting clones between ST and PB in different clusters as identified in the UMAP 2f. h Proportion of intersecting clones between ST and PB in different tissue of origin as identified in the UMAP 2f.

  8. CellTracksColab - T cell dataset (full)

    • zenodo.org
    zip
    Updated Jan 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Jacquemet; Guillaume Jacquemet (2024). CellTracksColab - T cell dataset (full) [Dataset]. http://doi.org/10.5281/zenodo.10539720
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guillaume Jacquemet; Guillaume Jacquemet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the manuscript "CellTracksColab—A platform for compiling, analyzing, and exploring tracking data"

    This Zenodo archive contains:

    • The raw video (Tracks.zip)
    • The tracking files as XML and CSV files (Tracks.zip)
    • The CellTracksColab dataframes storing the dataset (CellTracksColab_results.zip)
    • The CellTracksColab outputs used to make the figures in the paper (CellTracksColab_results.zip)

    In brief:

    In summary, Lab-Tek 8 chamber slides (ThermoFisher) were prepared by overnight coating with either 2 μg/mL ICAM-1 or VCAM-1 at a temperature of 4°C. Subsequently, activated primary mouse CD4+ T cells were cleansed and suspended in L-15 media, enriched with 2 mg/mL D-glucose. These T cells were then placed into the chamber slides and incubated for 20 minutes. Post-incubation, a gentle wash was performed to eliminate all unattached cells. The imaging process was conducted using a 10x phase contrast objective at 37°C, utilizing a Zeiss Axiovert 200M microscope equipped with an automated X-Y stage and a Roper EMCCD camera. Time-lapse imaging was executed at intervals of 1 minute over 10 minutes, employing SlideBook 6 software from Intelligent Imaging Innovations.

    Cells were automatically tracked using StarDist, directly implemented within TrackMate. The StarDist model was trained using ZeroCostDL4Mic and is publicly available on Zenodo. This model generated excellent segmentation results on our test dataset (F1 score > 0.99). In TrackMate, the StarDist detector custom model (score threshold = 0.41 and overlap threshold = 0.5) and the Simple LAP tracker (linking max distance = 30 µm; gap closing max distance = 15 µm, gap closing max frame gap = 2 frames) were used.

    In CellTracksColab, we conducted a dimensionality reduction analysis employing Uniform Manifold Approximation and Projection (UMAP). The UMAP settings were as follows: number of neighbors (n_neighbors) set to 20, minimum distance (min_dist) to 0, and number of dimensions (n_dimension) to 2. This analysis utilized an array of track metrics, including:

    NUMBER_SPOTS, NUMBER_GAPS, NUMBER_SPLITS, NUMBER_MERGES, NUMBER_COMPLEX, LONGEST_GAP, TRACK_DURATION, TRACK_DISPLACEMENT, TRACK_MEAN_SPEED, TRACK_MAX_SPEED, TRACK_MIN_SPEED, TRACK_MEDIAN_SPEED, TRACK_STD_SPEED, TRACK_MEAN_QUALITY, TOTAL_DISTANCE_TRAVELED, MAX_DISTANCE_TRAVELED, CONFINEMENT_RATIO, MEAN_STRAIGHT_LINE_SPEED, LINEARITY_OF_FORWARD_PROGRESSION, MEAN_DIRECTIONAL_CHANGE_RATE, Directionality, Tortuosity, Total_Turning_Angle, Spatial_Coverage, MEAN_CIRCULARITY, MEAN_SOLIDITY, MEAN_SHAPE_INDEX, MEDIAN_CIRCULARITY, MEDIAN_SOLIDITY, MEDIAN_SHAPE_INDEX, STD_CIRCULARITY, STD_SOLIDITY, STD_SHAPE_INDEX, MIN_CIRCULARITY, MIN_SOLIDITY, MIN_SHAPE_INDEX, MAX_CIRCULARITY, MAX_SOLIDITY, MAX_SHAPE_INDEX

    Subsequently, clustering analysis was performed using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). The parameters included clustering_data_source set to UMAP, min_samples at 20, min_cluster_size at 200, and the metric employed was Euclidean.

  9. Reddit topics dataset

    • kaggle.com
    zip
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano viel (2024). Reddit topics dataset [Dataset]. https://www.kaggle.com/stefano1283/reddit-topic-dataset
    Explore at:
    zip(26120858380 bytes)Available download formats
    Dataset updated
    Oct 2, 2024
    Authors
    Stefano viel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Interactive plots: https://www.redditopics.xyz

    Reddit Topics Dataset Overview

    This dataset provides a structured analysis of Reddit posts, originally sourced from the Pushshift dataset. It contains 43 million posts, reduced from an initial 1.7 billion through a robust pipeline focused on efficient data handling. The goal of this dataset is to categorize unstructured text posts into meaningful topics, making it easier to analyze the massive volume of Reddit data.

    The analysis was performed using the BERTopic model, which combines natural language processing (NLP) techniques with advanced clustering methods to identify and label distinct topics. The process involved embedding posts with BERT, reducing dimensionality using UMAP, clustering with HDBSCAN, and representing topics using c-TF-IDF. Finally, ChatGPT was used to assign human-readable names to each topic.

    This dataset is useful for anyone interested in social media analysis, natural language processing, or large-scale text analysis. It includes both the original posts and the topics assigned to each post.

  10. Additional file 4 of GECO: gene expression clustering optimization app for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 4 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642379.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    A. N. Habowski; T. J. Habowski; M. L. Waterman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.

  11. Z

    Data from: Robust clustering and interpretation of scRNA-seq data using...

    • data.niaid.nih.gov
    Updated May 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Schmidt; Bobby Ranjan (2021). Robust clustering and interpretation of scRNA-seq data using reference component analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4021966
    Explore at:
    Dataset updated
    May 30, 2021
    Dataset provided by
    Genome Institute of Singapore (GIS), A*Star
    Authors
    Florian Schmidt; Bobby Ranjan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets and Code accompanying the new release of RCA, RCA2. The R-package for RCA2 is available at GitHub: https://github.com/prabhakarlab/RCAv2/

    The datasets included here are:

    Datasets required for a characterization of batch effects:

    merged_rna_seurat.rds

    de_list.rds

    mergedRCAObj.rds

    merged_rna_integrated.rds

    10X_PBMCs.RDS: Processed 10X PBMC data RCA2 object (10X PBMC example data sets )

    NBM_RDS_Files.zip: Several RDS files containing RCA2 object of Normal Bone Marrow (NBM) data, umap coordinates, doublet finder results and metadata information (Normal Bone Marrow use case)

    Dataset used for the Covid19 example:

    blish_covid.seu.rds

    rownames_of_glocal_projection_immune_cells.txt

    Blish_RCA_no_QC_filtering_project_to_multiple_panels.rds

    Data sets used to outline the ability of supervised clustering to detect disease states:

    809653.seurat.rds

    blish_covid.seu.rds

    Performance benchmarking results:

    Memory_consumption.txt

    rca_time_list.rds

    ScanPY input files:

    input_data.zip

    The R script provides R code to regenerate the main paper Figures 2 to 7 modulo some visual modifications performed in Inkscape.

    Provided R scripts are:

    ComputePairWiseDE_v2.R (Required code for pairwise DE computation)

    RCA_Figure_Reproduction.R

    Provided python Code for Scanpy analysis:

    RA_Scanpy.ipynb

    CITESeq_Scanpy.ipynb

  12. n

    Data from: Exploratory mass cytometry analysis reveals immunophenotypes of...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toyoshi Yanagihara (2024). Exploratory mass cytometry analysis reveals immunophenotypes of cancer treatment-related pneumonitis [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhxd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    Kyushu University
    Authors
    Toyoshi Yanagihara
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Anti-cancer treatments can result in various adverse effects, including infections due to immune suppression/dysregulation and drug-induced toxicity in the lung. One of the major opportunistic infections is Pneumocystis jirovecii pneumonia (PCP), which can cause severe respiratory complications and high mortality rates. Cytotoxic drugs and immune-checkpoint inhibitors (ICIs) can induce interstitial lung diseases (ILDs). Nonetheless, the differentiation of these diseases can be difficult, and the pathogenic mechanisms of such diseases are not yet fully understood. To better comprehend the immunophenotypes, we conducted an exploratory mass cytometry analysis of immune cell subsets in bronchoalveolar lavage fluid from patients with PCP, cytotoxic drug-induced ILD (DI-ILD), and ICI-associated ILD (ICI-ILD) using two panels containing 64 markers. In PCP, we observed an expansion of the CD16+ T cell population, with the highest CD16+ T proportion in a fatal case. In ICI-ILD, we found an increase in CD57+ CD8+ T cells expressing immune checkpoints (TIGIT+ LAG3+ TIM-3+ PD-1+), FCRL5+ B cells, and CCR2+ CCR5+ CD14+ monocytes. These findings uncover the diverse immunophenotypes and possible pathomechanisms of cancer treatment-related pneumonitis. Methods Cisplatin-positive cells and doublets were excluded to select live cells, and CD45+ cells were further analyzed. For T cells, CD2+ CD3+ cells were gated and subjected to UMAP and Citrus algorithms. The UMAP analysis included clustering channels for CD4, CD8a, CD27, CD28, CD45RA, CD45RO, Fas, and used the parameters: numbers of neighbors = 15, minimum distance = 0.01. The Citrus algorithm for T cells included clustering channels for CD4, CD5, CD7, CD8a, CD11a, CD16, CD27, CD28, CD44, CD45RA, CD45RO, CD49d, CD57, CD69, CD226, Fas, IL-2R, PD-L1, PD-L2, PD-1, OX40, TIGIT, TIM3, CTLA-4, CD223 (LAG-3), BTLA, ICOS, ST2, CCR7, CXCR3, HLA-DR, and used the parameters: association models = nearest shrunken centroid (PAMR), cluster characterization = abundance, minimum cluster size = 5%, cross validation folds = 5, false discovery rate = 1%. The citrus analysis excluded nivolumab-induced ILD cases due to the apparent loss of PD-1 detection caused by competitive inhibition by nivolumab (Supplementary Figure 1B) (Yanagihara et al., 2020). The viSNE analysis for T cells included clustering channels for CD4, CD5, CD7, CD8a, CD11a, CD16, CD27, CD28, CD44, CD45RA, CD45RO, CD49d, CD57, CD69, CD226, Fas, IL-2R, PD-L1, PD-L2, PD-1, OX40, TIGIT, TIM3, CTLA-4, CD223 (LAG-3), BTLA, ICOS, ST2, CCR7, CXCR3, and HLA-DR, and used the parameters: iterations = 1000, perplexity = 30, theta = 0.5. For myeloid cells, CD3– CD11b+ CD11c+ cells were gated, and UMAP and Citrus algorithms were used. The UMAP analysis included clustering channels for CD11b, CD11c, CD64, CD14, CD16, CD206, HLA-DR, and CCR2, and used the parameters: numbers of neighbors = 10, minimum distance = 0.01. The Citrus algorithm included clustering channels for CD11b, CD11c, CD64, CD14, CD16, CD32, CD36, CD38, CD84, CD86, CD163, CD206, CD209, CD223, HLA-DR, CCR2, CCR5, and ST2, and used the parameters: association models = nearest shrunken centroid (PAMR), cluster characterization = abundance, minimum cluster size = 5%, cross validation folds = 5, false discovery rate = 1%. One case of PCP was excluded from the Citrus analysis due to low cell numbers. B cells and Plasma cells were identified using gating of CD3–CD64– and CD19+ or CD138+ cells. viSNE analysis was performed to cluster B cells using the following markers: CD19, CD38, CD11c, IgA, IgG, CD138, CD21, ST2, CXCR5, CD24, CD27, TIM-1, IgM, HLA-DR, IgD, and FCRL5. The analysis was performed on both individual and concatenated files using the parameters of 1000 iterations, perplexity of 30, and theta of 0.5. The selection of the dimensionality reduction technique, UMAP or viSNE, was made based on their ability to retain the relationships between global structures and the distances between cell clusters (where UMAP outperformed viSNE) and their ability to present a distinct and non-overlapping portrayal of cell subpopulations, facilitating the identification of inter-group variations (where viSNE performed better than UMAP).

  13. Twitter Airline Sentiment Dataset

    • kaggle.com
    zip
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandana Ramakrishna (2025). Twitter Airline Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/chandana890/twitter-airline-sentiment-dataset
    Explore at:
    zip(1134990 bytes)Available download formats
    Dataset updated
    Nov 14, 2025
    Authors
    Chandana Ramakrishna
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    This dataset contains tweets related to major US airlines and is widely used for NLP and sentiment analysis tasks. Each record includes the tweet text, timestamp, airline name, and sentiment label (positive, negative, neutral). This uploaded version is prepared to support advanced text processing, machine learning, and anomaly detection experiments.

    What's Included

    • Tweets.csv – Full collection of airline-related tweets
    • Text content suitable for NLP tasks
    • Timestamp information (useful for time-based analysis)
    • Sentiment labels for classification and evaluation
    • Cleaned text field for direct use in ML pipelines

    Purpose of This Dataset

    This dataset is used in a machine learning workflow focused on: - sentiment analysis
    - embedding generation (transformers)
    - dimensionality reduction (PCA, UMAP)
    - clustering and visualization
    - unsupervised anomaly detection using Isolation Forest

    It is especially suited for exploring changes in public sentiment, event detection, and contextual analysis in social media data.

    Key Use Cases

    • Building and testing NLP models
    • Semantic similarity and embedding-based analysis
    • Sentiment classification
    • Detecting anomalous posts or time periods
    • Visualizing tweet clusters using UMAP
    • Studying customer feedback patterns in the airline industry

    Source

    Originally derived from the Twitter US Airline Sentiment dataset on Kaggle.
    This uploaded version is intended for educational, analytical, and research purposes.

    Notes

    If you're using this dataset in a notebook, ensure you update your file path accordingly: ```python df = pd.read_csv("/kaggle/input/twitter-airline-sentiment-dataset/Tweets.csv")

  14. d

    Data for \"Surfacing Semantic Orthogonality Across Model Safety Benchmarks:...

    • search.dataone.org
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bennion, Jonathan (2025). Data for \"Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis\" [Dataset]. http://doi.org/10.7910/DVN/JXU6DC
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Bennion, Jonathan
    Description

    Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.

  15. d

    Data from: Computational analyses of dynamic visual courtship display reveal...

    • dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noori Choi; Eileen Hebets; Dustin Wilgers (2025). Computational analyses of dynamic visual courtship display reveal diet-dependent and plastic male signaling in Rabidosa rabida wolf spiders [Dataset]. http://doi.org/10.5061/dryad.sbcc2frb6
    Explore at:
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Noori Choi; Eileen Hebets; Dustin Wilgers
    Time period covered
    Jan 1, 2023
    Description

    It has long been a challenge to quantify the variation in dynamic motions to understand how those displays function in animal communication. The traditional approach is dependent on labor-intensive manual identification/annotation by experts. However, the recent progress in computational techniques provides researchers with toolsets for rapid, objective, and reproducible quantification of dynamic visual displays. In the present study, we investigated the effects of diet manipulation on dynamic visual components of male courtship displays of Rabidosa rabida wolf spiders using machine learning algorithms. Our results suggest that (i) the computational approach can provide an insight into the variation in the dynamic visual display between high- and low-diet males which is not clearly shown with the traditional approach and (ii) males may plastically alter their courtship display according to the body size of females they encounter. Through the present study, we add an example of the utili..., Raw data - We recorded male courtship with a Photron Fastcam 1024 PCI 100k high-speed camera (Photron USA, San Diego, CA, USA) and a Sony DCR-HC65 NTSC Handycam (Sony Electronics Inc., USA). Then, we analyzed the movement of the foreleg and pedipalps during the selected courtship bouts using ProAnalyst Lite software (Xcitex Inc., Woburn, Massachusetts, USA). We first set the x-axis and y-axis by where the pedipalp tip was in contact with the substrate (y-position 0) and most posterior point of the abdomen (x-position 0) at the beginning of the courtship bout. When the foreleg or pedipalps did not move during the courtship bout, the location of the joint was recorded by the location of the parts at the cocked position. In the case of the image being blurred, the location of blurred points was guessed based on the previous or subsequent frames or other parts in the current frame., , # Computational analyses of the courtship dance of male wolf spiders

    • 4 Python codes, 1 R code and 4 CSV files are included.
    1. 0_raw_data_process.py
    • fill the non-observed values with the initial position of each features
    • create gif and png figures to describe the visual display
    • require the following packages
      • numpy, pandas, seaborn, matplotlib, math
    1. 1_rabidosa_pose_cluster.py
    • conduct clustering posture of forelegs from each frame
    • using UMAP and HDBSCAN
    • require the following packages
      • umap, hdbscan, pickle, pandas, numpy, tensorflow, seaborn, matplotlib, scipy, sklearn
    1. 2_rabidosa_LSTM.py
    • train and save LSTM model of dynamic visual display of male R. rabida
    • clustering visual displays using umap and hdbscan
    • require the following packages
      • umap, hdbscan, pickle, pandas, numpy, tensorflow, seaborn, matplotlib, tsaug, sklearn
    1. 3_trad_clustering.py
    • clustering visual displays using traditional features with umap and hdbscan
    • require the...
  16. c

    Datasets used in the benchmarking exercise by SOMOC and iRAPCA

    • ri.conicet.gov.ar
    • datosdeinvestigacion.conicet.gov.ar
    • +1more
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberca, Lucas Nicolás; Bellera, Carolina Leticia; Prada Gori, Denis Nihuel; Llanos, Manuel; Talevi, Alan (2024). Datasets used in the benchmarking exercise by SOMOC and iRAPCA [Dataset]. https://ri.conicet.gov.ar/handle/11336/243803
    Explore at:
    Dataset updated
    Sep 9, 2024
    Authors
    Alberca, Lucas Nicolás; Bellera, Carolina Leticia; Prada Gori, Denis Nihuel; Llanos, Manuel; Talevi, Alan
    License

    Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
    License information was derived automatically

    Dataset funded by
    Ministerio de Ciencia, Tecnología e Innovación Productiva. Agencia Nacional de Promoción Científica y Tecnológica. Fondo para la Investigación Científica y Tecnológica
    Description

    Two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances.

  17. Additional file 5 of GECO: gene expression clustering optimization app for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 5 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642382.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    A. N. Habowski; T. J. Habowski; M. L. Waterman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.

  18. CellTracksColab - breast cancer cell dataset

    • zenodo.org
    zip
    Updated Jan 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Jacquemet; Guillaume Jacquemet (2024). CellTracksColab - breast cancer cell dataset [Dataset]. http://doi.org/10.5281/zenodo.10539020
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guillaume Jacquemet; Guillaume Jacquemet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the manuscript "CellTracksColab—A platform for compiling, analyzing, and exploring tracking data"

    This Zenodo archive contains:

    • The raw video (Raw zip files)
    • The tracking files as XML and CSV files (Tracks.zip)
    • The masks used to identify the edges of the monolayer (Monolayer_edges.zip)
    • The CellTracksColab dataframes storing the dataset (CellTracksColab_results.zip)
    • The CellTracksColab outputs used to make the figures in the paper (CellTracksColab_results.zip)

    In brief:

    In this experiment, approximately 50,000 shCTRL or shMYO10 lifeact-RFP DCIS.COM cells were seeded into one well of an ibidi culture-insert 2 well pre-placed in a µ-Slide 8 well. The cells were cultured for 24 hours, after which the culture insert was removed to create a wound-healing assay setup. When appropriate, a fibrillar collagen gel (PureCol EZ Gel) was applied over the cells and allowed to polymerize for 30 minutes at 37°C. Standard culture media was added to all wells, and the cells were left to migrate/invade for two days. Before live cell imaging, the cells were treated with 0.5 µM SiR-DNA (SiR-Hoechst, Tetu-bio) for two hours. Imaging was performed over 14 hours using a Marianas spinning-disk confocal microscope system. This system included a Yokogawa CSU-W1 scanning unit mounted on an inverted Zeiss Axio Observer Z1 microscope (Intelligent Imaging Innovations, Inc.). Imaging was conducted using a 20x (NA 0.8) air Plan Apochromat objective (Zeiss), and images were captured at 10-minute intervals.

    Cell tracking was conducted using Fiji and TrackMate. The Stardist detector was employed to detect nuclei using the Stardist versatile model. Tracks were created using the Kalman tracker (a maximum frame gap of 1, a Kalman search radius of 20 µm, and a linking maximum distance of 15 µm). Post-tracking, tracks were filtered so that each track had to contain more than six spots, ensuring a significant amount of data per track, and the total distance traveled by cells had to be greater than 89 µm.

    In CellTracksColab, we conducted a dimensionality reduction analysis employing Uniform Manifold Approximation and Projection (UMAP). The UMAP settings were as follows: number of neighbors (n_neighbors) set to 10, minimum distance (min_dist) to 0.5, and number of dimensions (n_dimension) to 2. This analysis utilized an array of track metrics, including:

    NUMBER_SPOTS, NUMBER_GAPS, NUMBER_SPLITS, NUMBER_MERGES, NUMBER_COMPLEX, LONGEST_GAP, TRACK_DURATION, TRACK_DISPLACEMENT, TRACK_MEAN_SPEED, TRACK_MAX_SPEED, TRACK_MIN_SPEED, TRACK_MEDIAN_SPEED, TRACK_STD_SPEED, TRACK_MEAN_QUALITY, TOTAL_DISTANCE_TRAVELED, MAX_DISTANCE_TRAVELED, CONFINEMENT_RATIO, MEAN_STRAIGHT_LINE_SPEED, LINEARITY_OF_FORWARD_PROGRESSION, MEAN_DIRECTIONAL_CHANGE_RATE, Directionality, Tortuosity, Total_Turning_Angle, Spatial_Coverage, MEAN_RADIUS, MEAN_CIRCULARITY, MEAN_SOLIDITY, MEAN_SHAPE_INDEX, MEDIAN_RADIUS, MEDIAN_CIRCULARITY, MEDIAN_SOLIDITY, MEDIAN_SHAPE_INDEX, STD_RADIUS, STD_CIRCULARITY, STD_SOLIDITY, STD_SHAPE_INDEX, MIN_RADIUS, MIN_CIRCULARITY, MIN_SOLIDITY, MIN_SHAPE_INDEX, MAX_RADIUS, MAX_CIRCULARITY, MAX_SOLIDITY, MAX_SHAPE_INDEX, MaxDistance_edge, MinDistance_edge, StartDistance_edge, EndDistance_edge, MedianDistance_edge, StdDevDistance_edge, DirectionMovement_edge, AvgRateChange_edge, PercentageChange_edge, TrendSlope_edge.

    Subsequently, clustering analysis was performed using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). The parameters included clustering_data_source set to UMAP, min_samples at 20, min_cluster_size at 200, and the metric employed was Canberra.

  19. f

    Data from: iRaPCA and SOMoC: Development and Validation of Web Applications...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Talevi, Alan; Llanos, Manuel A.; Gori, Denis N. Prada; Alberca, Lucas N.; Bellera, Carolina L. (2022). iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000275884
    Explore at:
    Dataset updated
    Jun 10, 2022
    Authors
    Talevi, Alan; Llanos, Manuel A.; Gori, Denis N. Prada; Alberca, Lucas N.; Bellera, Carolina L.
    Description

    The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.

  20. Statistical separation of UMAP clusters between real datasets.

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam T. M. Ball; Numan Celik; Elaheh Sayari; Lina Abdul Kadir; Fiona O’Brien; Richard Barrett-Jolley (2023). Statistical separation of UMAP clusters between real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0267452.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sam T. M. Ball; Numan Celik; Elaheh Sayari; Lina Abdul Kadir; Fiona O’Brien; Richard Barrett-Jolley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical separation of UMAP clusters between real datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Teng Li; Yiran Zou; Xianghan Li; Thomas K. F. Wong; Allen G. Rodrigo (2024). Additional file 3 of Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data [Dataset]. http://doi.org/10.6084/m9.figshare.27123950.v1
Organization logoOrganization logo

Additional file 3 of Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data

Related Article
Explore at:
csvAvailable download formats
Dataset updated
Sep 28, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Teng Li; Yiran Zou; Xianghan Li; Thomas K. F. Wong; Allen G. Rodrigo
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Supplementary file 3. AnnData format of the 12 NSCLC patients dataset.

Search
Clear search
Close search
Google apps
Main menu