Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
DrCyZ: Techniques for analyzing and extracting useful information from CyZ.
Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.
Repository: https://github.com/decurtoidiaz/drcyz
Subset of samples from (includes tools to visualize and analyse the dataset):
CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]
Images from NASA missions of the celestial body.
Repository: https://github.com/decurtoidiaz/cyz
Authors:
J. de Curtò c@decurto.be
I. de Zarzà z@dezarza.be
• Subset of samples from Perseverance (drcyz/c).
∙ png (drcyz/c/png).
PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering.
∙ csv (drcyz/c/csv).
CSV file.
• Resized samples from Perseverance (drcyz/c+).
∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
PNG files resized at the corresponding size.
∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
TFRecord resized at the corresponding size to import on Tensorflow.
• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
PNG files subset of 100, 1000 and 10000 at size 256x256.
• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
∙ network-snapshot-000798-drcyz.pkl
• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Curiosity.
∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Perseverance.
∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Table 2: All Genes TSNE-clusters RNA-seq data
Dataset and labels for the article of Keratoconus severity identification using unsupervised machine learning by Siamak Yousefi
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Scripts used for analysis of V1 and V2 Datasets.seurat_v1.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, PCA analysis, clustering, tSNE visualization. Used for v1 datasets. merge_seurat.R - merge two or more seurat objects into one seurat object. Perform linear regression to remove batch effects from separate objects. Used for v1 datasets. subcluster_seurat_v1.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA. Used for v1 datasets.seurat_v2.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, and PCA analysis. Used for v2 datasets. clustering_markers_v2.R - clustering and tSNE visualization for v2 datasets. subcluster_seurat_v2.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA analysis. Used for v2 datasets.seurat_object_analysis_v1_and_v2.R - downstream analysis and plotting functions for seurat object created by seurat_v1.R or seurat_v2.R. merge_clusters.R - merge clusters that do not meet gene threshold. Used for both v1 and v2 datasets. prepare_for_monocle_v1.R - subcluster cells of interest and perform linear regression, but not scaling in order to input normalized, regressed values into monocle with monocle_seurat_input_v1.R monocle_seurat_input_v1.R - monocle script using seurat batch corrected values as input for v1 merged timecourse datasets. monocle_lineage_trace.R - monocle script using nUMI as input for v2 lineage traced dataset. monocle_object_analysis.R - downstream analysis for monocle object - BEAM and plotting. CCA_merging_v2.R - script for merging v2 endocrine datasets with canonical correlation analysis and determining the number of CCs to include in downstream analysis. CCA_alignment_v2.R - script for downstream alignment, clustering, tSNE visualization, and differential gene expression analysis.
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Plants are often attacked by various pathogens during their growth, which may cause environmental pollution, food shortages, or economic losses in a certain area. Integration of high throughput phenomics data and computer vision (CV) provides a great opportunity to realize plant disease diagnosis in the early stage and uncover the subtype or stage patterns in the disease progression. In this study, we proposed a novel computational framework for plant disease identification and subtype discovery through a deep-embedding image-clustering strategy, Weighted Distance Metric and the t-stochastic neighbor embedding algorithm (WDM-tSNE). To verify the effectiveness, we applied our method on four public datasets of images. The results demonstrated that the newly developed tool is capable of identifying the plant disease and further uncover the underlying subtypes associated with pathogenic resistance. In summary, the current framework provides great clustering performance for the root or leave images of diseased plants with pronounced disease spots or symptoms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets linked to publication "Revealing viral and cellular dynamics of HIV-1 at the single-cell level during early treatment periods", Otte et al 2023 published in Cell Reports Methods pre-ART (antiretroviral therapy) cryo-conserved and and whole blood specimen were sampled for HIV-1 virus reservoir determination in HIV-1 positive individuals from the Swiss HIV Study Cohort. Patients were monitored for proviral (DNA), poly-A transcripts (RNA), late protein translation (Gag and Envelope reactivation co-detection assay, GERDA) and intact viruses (golden standard: viral outgrowth assay, VOA). In this dataset we deposited the pipeline for the multidimensional data analysis of our newly established GERDA method, using DBScan and tSNE. For further comprehension NGS and Sanger sequencing data were attached as processed and raw data (GenBank).
Resubmitted to Cell Reports Methods (Jan-2023), accepted in principal (Mar-2023)
GERDA is a new detection method to decipher the HIV-1 cellular reservoir in blood (tissue or any other specimen). It integrates HIV-1 Gag and Env co-detection along with cellular surface markers to reveal 1) what cells still contain HIV-1 translation competent virus and 2) which marker the respective infected cells express. The phenotypic marker repertoire of the cells allow to make predictions on potential homing and to assess the HIV-1 (tissue) reservoir. All FACS data were acquired on a LSRFortessa BD FACS machine (markers: CCR7, CD45RA, CD28, CD4, CD25, PD1, IntegrinB7, CLA, HIV-1 Env, HIV-1 Gag) Raw FACS data (pre-gated CD4CD3+ T-cells) were arcsin transformed and dimensionally reduced using optsne. Data was further clustered using DBSCAN and either individual clusters were further analyzed for individual marker expression or expression profiles of all relevant clusters were analyzed by heatmaps. Sequences before/after therapy initiation and during viral outgrowth cultures were monitored for individuals P01-46 and P04-56 by Next-generation sequencing (NGS of HIV-1 Envelope V3 loop only) and by Sanger (single genome amplification, SGA)
data normalization code (by Julian Spagnuolo) FACS normalized data as CSV (XXX_arcsin.csv) OMIQ conText file (_OMIQ-context_XXX) arcsin normalized FACS data after optsne dimension reduction with OMIQ.ai as CSV file (XXXarcsin.csv.csv) R pipeline with codes (XXX_commented.R) P01_46-NGS and Sanger sequences P04_56-NGS and Sanger sequences
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There are 3 files in seurat_results.zip: one containing the principal component values used for dimensionality reduction and clustering of all MSN, one containing the computed tSNE values, and one containing the louvain clusters. The metadata_final.csv file contains the annotated major cell types and subtypes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset name, reference, dimensions and cell type composition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tables containing additional information on genes and cell obtained from single-cell RNA-Seq analysis of mouse testis data. SuppTableCells.xlsx contains the 10X Barcode as identifier, the replicate ID, position in the t-SNE plot, UMI and gene count per cell, the proportion of mitochondrial transcripts, cluster ID obtained by k-Means clustering (k=9), inferred cell type, and pseudotime information obtained using monocle and Scrat.SuppTableGenes contains the average expression value, fold-change compared to other cell types, and p-value relative to other cell types for each gene that was expressed in the dataset. Throughout the data, the following abbreviations for cell types are used: Spg=spermatogonia, SC=spermatocytes, RS=round spermatids, ES=elongating spermatids, CS=condensed/condensing spermatids. In cases where several clusters were identified per cell type, the earlier cluster was designated as 1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top 10 GO biological processes for BRCA1 coexpressed genes from an EnrichR analysis from top 100 genes coexpressed at the tissue-level or at the system-level.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
dge_raw_data.zip: contains separate raw count matrices for 17 adult Xenopus tissues (bladder, bone marrow, brain, eye, heart, intestine, kidney, liver, lung, muscle, ovary, oviduct, pancreas, skin, spleen, stomach and testis) and 4 larva stages (St48, St54, St59 and St66) dge_cell_info.zip: contains the cell annotations corresponding to dge_raw_data.zip, including tSNE coordinates, clusters, stages and cell-type annotations.Xenopus_Figure1.h5ad: contains both raw count and normalized datasets (contains only highly variable genes) for 501,358 cells in XCL, which can be processed into the python environment with SCANPY directly.Xenopus_Figure1_cell_info.csv: contains the cell annotations for 501,358 cells in XCL, including tSNE coordinates, clusters, tissue origins, stages and cell-type annotations.Larva_Figure4.h5ad: contains both raw count and normalized datasets (contains only highly variable genes) for 188,020 larva cells in XCL, which can be processed into the python environment with SCANPY directly.Larva_Figure4_cell_info.csv: contains the cell annotations for 188,020 larva cells in XCL, including tSNE coordinates, clusters, stages and cell-type annotations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
t-SNE embedding illustrates the distribution of cell populations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Parameters for the t-distributed Stochastic Neighbor Embedding (t-SNE).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.