CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File “UMAP plots split by dataset and sample” supplied the comparison of UMAP plots at dataset or sample level colored by major cell types.
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Support data for our paper:
USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS
ArXiv preprint can be found here. Code for the experiment software pipeline described in the paper can be found here. The pipeline requires and generates different forms of data. Here we provide the following:
AudioSet_wav_fragments.zip: This is a custom selection of 39437 wav files (32kHz, mono, 10 seconds) randomly extracted from AudioSet (originally released under CC-BY). In addition to this custom subset, the paper also uses the following ones, which can be downloaded at their respective websites:
DCASE2021 Task 2 Development Dataset
DCASE2021 Task 2 Additional Training Dataset
Fraunhofer's IDMT-ISA-ELECTRIC-ENGINE Dataset
dcase2021_uads_umaps.zip: To compute the UMAPs, first the log-STFT, log-mel and L3 representations must be extracted, and then the UMAPs must be computed. This can take a substantial amount of time and resources. For convenience, we provide here the 72 UMAPs discussed in the paper.
dcase2021_uads_umap_plots.zip: Also for convenience, we provide here the 198 high-resolution scatter plots rendered from the UMAPs.
For a comprehensive visual inspection of the computed representations, it is sufficient to download the plots only. Users interested in exploring the plots interactively will need to download all the audio datasets and compute the log-STFT, log-mel and L3 representations as well as the UMAPs themselves (code provided in the GitHub repository). UMAPs for further representations can also be computed and plotted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We developed a single-cell transcriptomics pipeline for high-throughput pharmacotranscriptomic screening. We explored the transcriptional landscape of three HGSOC models (JHOS2, a representative cell line; PDC2 and PDC3, two patient-derived samples) after treating their cells for 24 hours with 45 drugs representing 13 distinct classes of mechanism of action. Our work establishes a new precision oncology framework for the study of molecular mechanisms activated by a broad array of drug responses in cancer. . ├── 3D UMAPs/ → Interactive 3D UMAPs of cells treated with the 45 drugs used for multiplexed scRNA-seq. Related to Figure 4. Coordinates: x = UMAP 1; y = UMAP 2; z = UMAP 3. Legend: green = PDC1; blue = PDC2; red = JHOS2. │ ├── DMSO_3D_UMAP_Dini.et.al.html → 3D UMAP of untreated cells. │ └── drug_3D_UMAP_Dini.et.al.html → 3D UMAP of cells treated with (drug). ├── QC_plots/ → Diagnostic plots. Related to Figures 2–4. │ ├── model_QC_violin_plot_2023.pdf → Violin plots of the QC metrics used to filter the data. │ ├── model_col_HTO or model_row_HTO before and after filt → Heatmaps of the row or column HTO expression in each cell. │ └── model_counts_histogram_2023.pdf → Histogram of the distribution of the total counts per cell after filtering for high-quality cells. ├── scRNAseq/ → scRNA-seq data. Related to Figures 2–4. │ ├── AllData_subsampled_DGE_edgeR.csv.gz → Differential gene expression analyses results between treated and untreated cells via pseudobulk of aggregate subsamples, for each of the three models. Related to Figure 3. │ └── All_vs_all_RNAclusters_DEG_signif.txt → Differential gene expression analysis results (p.adj < 0.05) of FindAllMarkers for the Leiden/RNA clusters. ├── PDCs.transcript.counts.tsv → Bulk RNA-seq count data for PDCs 1–3 processed by Kallisto. Related to Figure S6. └── PDCs.transcript.TPM.tsv → Bulk RNA-seq TPM data for PDCs 1–3 processed by Kallisto. Related to Figure S6.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 6: Fig.S1 Gossypetin does not affect expression of β-, and γ-secretases and activity of β-secretase. (A to G) Time dependent β-secretase activity of mouse hippocampal lysate was measured with Relative Fluorescence Unit (RFU). Fluorescence excitation and emission wavelength was 335 nm and 495 nm respectively (A). Bar graph of RFU at each time point of 10 min (B), 20 min (C), 30 min (D), 40 min (E), 50 min (F), 60 min (G). (n = 10~12 mice per group) (H to L) Representative images of Western blot analysis for β-, γ-secretase subunits, and GAPDH (H). Bar graphs represent relative protein expression levels of BACE1 (I), Nicastrin (J), APH-1 (K), and PEN2 (L). (n = 12~15 mice per group) (M to P) Bar graphs represent relative mRNA expression level of β-, and γ-secretase subunits bace1 (M), ncstn (N), aph1 (O), pen2 (P). (n = 9~10 mice per group) Error bars represent the mean ± SD, p < 0.05, ns = not significant, two-way ANOVA followed by Tukey’s multiple comparisons test. Fig. S2 Cell type classification of brain samples. (A) UMAP plot showing all cells from the brain samples, colored by their cell types. (B) Heatmap illustrating the Z-scores of average normalized expressions of cell type markers. (C) Violin plots displaying the log-scaled number of detected genes (top), Unique Molecular Identifiers (UMIs) (middle), and the percentage of mitochondrial gene expressions (bottom) per cell for each cell type. (D) UMAP plots showing all cells from the brain samples, colored by their sampled region (left), mouse strain (middle), or drug administration (right) condition. Fig. S3 Detailed subtyping of the microglial population. (A) UMAP plots showing all microglial cells from cortex region. The cells are colored by their celltypes (left). Heatmap showing the Z-scores of average normalized expressions of representative DEGs for each cell type from cortex region (right). (B) UMAP plots showing microglial cells from cortex (left) or hippocampus (right), colored by combination of mouse strain and drug administration condition. (C) UMAP plots illustrating microglial cells from cortex (left) or hippocampus (right), colored by their inferred cell cycle. (D) Bar plots for the fraction of cortex (left) or hippocampus (right) microglial cells by sample conditions, which are the combination of mouse strain and drug administration, for each microglial subtype. Fig. S4 Differential gene expressions between vehicle- and gossypetin-treated microglia. (A) Scatter plot showing GOBP terms that are upregulated or downregulated by5xFAD construction or gossypetin administration for each microglial subtype from cortex. Significant (Fisher’s exact test, P < 0.01) terms associated with antigen presentation are colored by their biological keywords. (B) GSEA plots showing significant (P< 0.05) GOBP terms for gossypetin administration condition against vehicle treatment within 5xFAD homeostatic microglia from hippocampus region. Related to Fig. 3D. (C) Volcano plot illustrating the DEGs selected by the comparison between wild type and 5xFAD(left), or vehicle and gossypetin treated 5xFAD (right) from homeostatic microglial population of cortex region. Fig. S5 Transcriptomic transition in cortex microglia and measurement of DAM signature score. (A) Volcano plot showing significant (p < 0.05) DEGs selected by the comparison between cortex homeostatic microglia in vehicle treated wild type and 5xFAD (top left), or vehicle and gossypetin treated 5xFAD (top right). Volcano plots illustrating comparison between gossypetin administration condition against vehicle treatment within 5xFAD stage 1 DAM (bottom left) or stage 2 DAM (bottom right) from cortex are also presented. (B) Violin plot illustrating module scores for the DAM-related genes from previous studies. Cells are grouped by the combination of their mouse strain and treatment condition. (P < 0.001) Fig. S6 Gossypetin ameliorates gliosis in microglia and astrocytes. (A to D) Representative images of hippocampus (A) and cortex (C) stained with Hoechst and Iba-1. Scale bar corresponds to 200μm. Bar graph represents quantification of Iba-1 positive area in dentate gyrus of hippocampus (n = 9~12 mice per group, 3~6 slices per brain) (B) and cortex (n = 9~12 mice per group, 3~6 slices per brain) (D). (E to H) Representative images of hippocampus (E) and cortex (G) stained with Hoechst and GFAP. Scale bar corresponds to 200μm. Bar graph represents quantification of GFAP positive area in dentate gyrus of hippocampus (n = 9~12 mice per group, 3~6 slices per brain) (F) and cortex (n = 9~12 mice per group, 3~5 slices per brain) (H). The error bars represent the mean ± SEM.**p
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the set of data shown in the paper "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise", published on arXiv (DOI: 10.48550/arXiv.2412.09412).
The scripts contained herein are:
To reproduce the data of this work you should start form SOAP-Component-Analysis.py to calculate the SOAP descriptor and select the components that are interesting for you, then you can calculate the PCA with PCA-Analysis.py, and applying the clustering based on your necessities (OnionClustering-1d.py, OnionClustering-2d.py, Hierarchical-Clustering.py). Further modifications of the Onion plot can be done with the script: OnionClustering-plot.py. Umap can be calculated with UMAP.py.
Additional data contained herein are:
The data related to the Quincke rollers can be found here: https://zenodo.org/records/10638736
Monitoring Progression of Scleroderma
Project Description
This is a website for visualising datasets to study protein expression in Scleroderma patients. The website is able to generate the following plots:
Correlation Plot Boxplot UMAP plot Volcano plot Violin plot
Introduction
Scleroderma is an autoimmune disease that can cause thickened areas of skin and connective tissues. To gain a deeper understanding of this condition, analysing the expression of… See the full description on the dataset page: https://huggingface.co/datasets/nfc22/sclerobase_data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments.
Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment.
The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach.
The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales.
The datasets and scripts provided in this repository allow replicating the results presented in the publication.
Methods
Data acquisition and preparation
We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species.
We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1).
The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings.
Acoustic feature extraction
The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area.
UMAP ordination and visualization
UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots.
The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources
The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata.
For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model.
Label prediction performance
We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets.
The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
https://www.proteinatlas.org/about/licencehttps://www.proteinatlas.org/about/licence
This section contains Single Cell Type information based on single cell RNA sequencing (scRNAseq) data from 25 human tissues and peripheral blood mononuclear cells (PBMCs), together with in-house generated immunohistochemically stained tissue sections visualizing the corresponding spatial protein expression patterns. The scRNAseq analysis was based on publicly available genome-wide expression data and comprises all protein-coding genes in 444 individual cell type clusters corresponding to 15 different cell type groups. A specificity and distribution classification was performed to determine the number of genes elevated in these single cell types, and the number of genes detected in one, several or all cell types, respectively. The genes expressed in each of the cell types can be explored in interactive UMAP plots and bar charts, with links to corresponding immunohistochemical stainings in human tissues.
More information about the specific content and the generation and analysis of the data in the section can be found on the Methods Summary.
Learn about:
mRNA and protein expression in single cell types if a gene is enriched in a particular cell type (specificity) which genes have a similar expression profile across cell types (expression cluster)
Using CITE-seq we measured expression of 132 proteins on the cell surfaces of single human bone marrow aspirate cells. Expression of each protein was normalized with a isotype-specific control on a single cell basis. Principal compenent analysis of normalized proteins was used to produce UMAP plots that clustered like cell types. After identifying pro-B cells on UMAP plots and further refining these populations by filtering on CD19 expression and absent CD20/IgM expression, we identified differentially expressed genes between patient and control pro-B cells. 2 CITE-seq data sets were analyzed
This archive contains data of scRNAseq and CyTOF in form of Seurat objects, txt and csv files as well as R scripts for data analysis and Figure generation.
A summary of the content is provided in the following.
R scripts
Script to run Machine learning models predicting group specific marker genes: CML_Find_Markers_Zenodo.R Script to reproduce the majority of Main and Supplementary Figures shown in the manuscript: CML_Paper_Figures_Zenodo.R Script to run inferCNV analysis: inferCNV_Zenodo.R Script to plot NATMI analysis results:NATMI_CvsA_FC0.32_Updown_Column_plot_Zenodo.R Script to conduct sub-clustering and filtering of NK cells NK_Marker_Detection_Zenodo.R
Helper scripts for plotting and DEG calculation:ComputePairWiseDE_v2.R, Seurat_DE_Heatmap_RCA_Style.R
RDS files
General scRNA-seq Seurat objects:
scRNA-seq seurat object after QC, and cell type annotation used for most analysis in the manuscript: DUKE_DataSet_Doublets_Removed_Relabeled.RDS
scRNA-seq including findings e.g. from NK analysis used in the shiny app: DUKE_final_for_Shiny_App.rds
Neighborhood enrichment score computed for group A across all HSPCs: Enrichment_score_global_groupA.RDS
UMAP coordinates used in the article: Layout_2D_nNeighbours_25_Metric_cosine_TCU_removed.RDS
SCENIC files:
Regulon set used in SCENIC: 2.6_regulons_asGeneSet.Rds
AUC values computed for regulons: 3.4_regulonAUC.Rds
MetaData used in SCENIC cellInfo.Rds
Group specific regulons for LCS: groupSpecificRegulonsBCRAblP.RDS
Patient specific regulons for LSC: patientSpecificRegulonsBCRAblP.RDS
Patient specificity score for LSC: PatientSpecificRegulonSpecificityScoreBCRAblP.RDS
Regulon specificty score for LSC: RegulonSpecificityScoreBCRAblP.RDS
BCR-ABL1 inference:
HSC with inferred BCR-ABL1 label: HSCs_CML_with_BCR-Abl_label.RDS
UMAP for HSC with inferred BCR-ABL1 label: HSCs_CML_with_BCR-Abl_label_UMAP.RDS
HSPCs with BCR-ABL1 module scores: HSPC_metacluster_74K_with_modscore_27thmay.RDS
NK sub-clustering and filtering:
NK object with module scores: NK_8617cells_with_modscore_1stjune.RDS
Feature genes for NK cells computed with DubStepR: NK_Cells_DubStepR
NK cells Seurat object excluding contaminating T and B cells: NK_cells_T_B_17_removed.RDS
NK Seurat object including neighbourhood enrichment score calculations: NK_seurat_object_with_enrichment_labels_V2.RDS
txt and csv files:
Proportions per cluster calculated from CyTOF: CyTOF_Proportions.txt
Correlation between scRNAseq and CyTOF cell type abundance: scRNAseq_Cor_Cytof.txt
Correlation between manual gating and FlowSOM clustering: Manual_vs_FlowSOM.txt
GSEA results:
HSPC, HSC and LSC results: FINAL_GSEA_DATA_For_GGPLOT.txt
NK: NK_For_Plotting.txt
TFRC and HLA expression: TFRC_and_HLA_Values.txt
NATMI result files:
UP-regulated_mean.csv
DOWN-regulated_mean.csv
Gene position file used in inferCNV: inferCNV_gene_positions_hg38.txt
Module scores for NK subclusters per cell: NK_Supplementary_Module_Scores.csv
Compressed folders:
All CyTOF raw data files: CyTOF_Data_raw.zip
Results of the patient-based classifier: PatientwiseClassifier.zip
Results of the single-cell based classifier: SingleCellClassifierResults.zip
For general new data analysis approaches, we recommend the readers to use the Seruat object stored in DUKE_final_for_Shiny_App.rds or to use the shiny app(http://scdbm.ddnetbio.com/) and perform further analysis from there.
RAW data is available at EGA upon request using Study ID: EGAS00001005509
Revision
The for_CML_manuscript_revision.tar.gz folder contains scripts and data for the paper revision including 1) Detection of the BCR-ABL fusion with long read sequencing; 2) Identification of BCR-ABL junction reads with scRNAseq; 3) Detection of expressed mutations using scRNAseq.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file1: Figure S1. The information on MSC atlas taxonomy. (A) UMAP of all MSCs with cluster annotations, (B) UMAP of MSCs color-labelled by tissue, (C) Cell counts of MSCs from different tissues in each cluster, and (D) Cell counts of MSCs from different samples in each cluster. Figure S2. Differentiation scoring of MSCs on five differentiation directions. (A) Scoring of osteogenesis, chondrogenesis, adipogenesis, myogenesis and neurogenesis. (B) Scoring of representative gene expression for MSCs differentiation. Figure S3. Home page of MSCsDB. which includes website introduction, functionality overview, gene cloud, and website update news. Figure S4. Module of Dataset and link to the module of Explore. Users can view the metadata of each sample dataset, such as the original article, data repository and sequencing technology. Users can also click on the “Explore” button to view the sample’s clustering annotation, gene expression level analysis, pathway enrichment analysis, copy number variation analysis, and pseudotime analysis results. Figure S5. Functionality in the module of Atlas. (A) UMAP of MSCs with cluster annotations. Users can select specific clusters to view their distribution. The MSC atlas can also be classified by tissue or batch and shown separately. (B) Gene signature of MSCs. Users can analyze the cell percentage of all genes and click on the “View” button to view the gene expression levels in cells and clusters. The Gene Card database is also linked for users to view gene information. Users can also enter a specific gene in the search box to retrieve relevant information. Figure S6. An example of functionality in the module of Atlas. (A) Pathway enrichment analysis of MSCs from different databases. Users can switch between different databases. Users can also select specific clusters and pathways to view their enrichment status. (B) Copy number variation analysis of MSCs using copyKat and InferCNVpy packages. The copyKat software can predict whether the cells are normal cells (diploid) or tumor cells (aneuploid). The InferCNVpy package gives prediction values, so we provide chromosome heatmaps based on CNV clustering for users to distinguish between normal cells and tumor cells. (C) Pseudotime analysis of MSCs using PAGA method. We show the cell trajectory inference plot and cluster UMAP plot for a single sample. (D) Transcription factor network analysis of MSCs using pyscenic package. We provide the transcription factor network analysis result table and heatmap for a single sample’s cluster. Users can click on the “View” button in the table to view the target genes regulated by that transcription factor. Figure S7. De novo analysis for clustering, pathway enrichment, and quality evaluation. (A) UMAP plot of MSC clustering and annotation using Scanpy package for a sample dataset. (B) Pathway enrichment analysis using Clusterprofiler package for a sample dataset. (C) Copy number variation analysis using CopyKat and InferCNVpy packages for a sample dataset. Figure S8. De novo analysis for pseudotime and gene regulatory network analysis. (A) Pseudotime analysis using PAGA method for a sample dataset. (B) Gene regulatory network analysis using pyscenic package for a sample dataset. Table S1. Marker genes used for potency score analysis. Table S2. Scoring for each cluster using geneset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Embeddings of single-cell RNA-Seq data from three adult vertebrate brain datasets into Orthogroup feature space or Structural cluster feature space. Orthogroups were generated using OrthoFinder v5.5.0; Structural clusters were assigned by using FoldSeek to cluster AlphaFold-v4 structural predictions.
The three datasets used as the basis for these embeddings were:
For each dataset, we also generated a standardized cell type annotation file based on the author's originally provided cell type annotation data. The first column is the cell barcode for that species and the second column is the original study's cell type annotation for that cell.
For the Xenopus brain data, we removed around ~18k cells that were not annotated in the original data to simplify data analyses - these are reflected in the files with the "subsampled" suffix. Subsampled versions of the data are also available for the joint embedding space (prefixed with "DrerMmusXlae").
For the final datasets used in our analyses, we also provide features x cell matrices as .h5ad files for smaller file sizes and faster loading using Scanpy.
For visualizing our UMAP plots of our top200 embedding space, we provide ".tsv" files with a variety of metrics and the x and y positions of each cell in the UMAP. See "DrerMmusXlae_adultbrain_FoldSeek_plotlydata.tsv" and "DrerMmusXlae_adultbrain_OrthoFinder_plotlydata.tsv"
These data are part of the Arcadia Science Pub titled "Comparing gene expression across species based on protein structure instead of sequence".
https://doi.org/10.5061/dryad.tdz08kq8t
Dataset Overview
A detailed description of the general framework and specific methodology can be found in the relevant publication (https://doi.org/10.7554/eLife.102151.4).
For each dataset, barcode, feature, and matrix file from CellRanger output are provided. These files serve as inputs for preparing the Seurat objects used in this study. Barcode files contain a list of cell barcodes. Feature files contain gene names from the reference used for CellRanger and include 3 columns: ENSEMBL number, gene name, and the type of assay run ("GENE EXPRESSION"). Matrix files contain the sparse matrix containing UMI counts for each library.
Dissociated cells were loaded onto the 10X Chromium Cell Controller ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Interactive UMAP plot of the Australia recordings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Interactive UMAP plot of the French Polynesia recordings.
https://www.proteinatlas.org/about/licencehttps://www.proteinatlas.org/about/licence
The Single Cell Type Atlas contains single cell RNA sequencing (scRNAseq) data from 13 different human tissues, together with in-house generated immunohistochemically stained tissue sections visualizing the corresponding spatial protein expression patterns. The scRNAseq analysis was based on publicly available genome-wide expression data and comprises all protein-coding genes in 192 individual cell type clusters corresponding to 12 different cell type groups. A specificity and distribution classification was performed to determine the number of genes elevated in these single cell types, and the number of genes detected in one, several or all cell types, respectively. The genes expressed in each of the cell types can be explored in interactive UMAP plots and bar charts, with links to corresponding immunohistochemical stainings in human tissues.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mass cytometry and single-cell RNA-sequencing data as well as R Markdown reports to reproduce the figures of our publication.
Raw MC data were saved post de-convolution, spillover-compensation, and removal of calibration bead events. Gates for singlets and non-dead cells (low_Pt) are included as logical columns and should be applied prior to usage.
As we performed random sampling to equalise cell numbers across conditions, batch normalisation, and used non-linear dimensionality reduction techniques (UMAP and Diffusion Maps), resulting plots may differ slightly from the published figures, yet still support the drawn conclusions. Already normalised and/or sampled data as well as pre-computed UMAP and Diffusion Map coordinates are included in this data set to reproduce the manuscript figures exactly, as shown in the included report “figures_only”. For all details on the batch normalisation and data analysis steps performed, please consult the report “data_analysis” instead.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File “UMAP plots split by dataset and sample” supplied the comparison of UMAP plots at dataset or sample level colored by major cell types.