21 datasets found
  1. umap-learn

    • kaggle.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HyeongChan Kim (2025). umap-learn [Dataset]. https://www.kaggle.com/kozistr/umaplearn
    Explore at:
    zip(46934808 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    HyeongChan Kim
    Description

    UMAP

    Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

    The data is uniformly distributed on a Riemannian manifold; The Riemannian metric is locally constant (or can be approximated as such); The manifold is locally connected. From these assumptions, it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

    The details for the underlying mathematics can be found in our paper on ArXiv:

    McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

  2. D

    Data from: Data related to Panzer: A Machine Learning Based Approach to...

    • darus.uni-stuttgart.de
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Panzer (2024). Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins [Dataset]. http://doi.org/10.18419/DARUS-4576
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    DaRUS
    Authors
    Tim Panzer
    License

    https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576

    Time period covered
    Nov 1, 1976 - Feb 29, 2024
    Dataset funded by
    DFG
    Description

    This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7

  3. UMAP-Based Split

    • figshare.com
    csv
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amitesh Badkul (2025). UMAP-Based Split [Dataset]. http://doi.org/10.6084/m9.figshare.28908209.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Amitesh Badkul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    UMAP-Based split

  4. n

    Acoustic features as a tool to visualize and explore marine soundscapes:...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Memorial University of Newfoundland
    University of Parma
    Fisheries and Oceans Canada
    Authors
    Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
    Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested

  5. AI-MATH-LLM-Package

    • kaggle.com
    zip
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnson chong (2024). AI-MATH-LLM-Package [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/ai-math-llm-package
    Explore at:
    zip(3330554065 bytes)Available download formats
    Dataset updated
    Jun 20, 2024
    Authors
    Johnson chong
    Description

    This Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.

    Support Package list as below: transformer datasets accelerate bitsandbytes langchain langchain-community sentence-transformers chromadb
    faiss-cpu huggingface_hub langchain-text-splitters
    peft trl umap-learn evaluate deepeval weave

    Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl

  6. Data_Sheet_1_Manifold learning for fMRI time-varying functional...

    • frontiersin.figshare.com
    docx
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini (2023). Data_Sheet_1_Manifold learning for fMRI time-varying functional connectivity.docx [Dataset]. http://doi.org/10.3389/fnhum.2023.1134012.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 11, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.

  7. d

    Data from: Computational analyses of dynamic visual courtship display reveal...

    • dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noori Choi; Eileen Hebets; Dustin Wilgers (2025). Computational analyses of dynamic visual courtship display reveal diet-dependent and plastic male signaling in Rabidosa rabida wolf spiders [Dataset]. http://doi.org/10.5061/dryad.sbcc2frb6
    Explore at:
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Noori Choi; Eileen Hebets; Dustin Wilgers
    Time period covered
    Jan 1, 2023
    Description

    It has long been a challenge to quantify the variation in dynamic motions to understand how those displays function in animal communication. The traditional approach is dependent on labor-intensive manual identification/annotation by experts. However, the recent progress in computational techniques provides researchers with toolsets for rapid, objective, and reproducible quantification of dynamic visual displays. In the present study, we investigated the effects of diet manipulation on dynamic visual components of male courtship displays of Rabidosa rabida wolf spiders using machine learning algorithms. Our results suggest that (i) the computational approach can provide an insight into the variation in the dynamic visual display between high- and low-diet males which is not clearly shown with the traditional approach and (ii) males may plastically alter their courtship display according to the body size of females they encounter. Through the present study, we add an example of the utili..., Raw data - We recorded male courtship with a Photron Fastcam 1024 PCI 100k high-speed camera (Photron USA, San Diego, CA, USA) and a Sony DCR-HC65 NTSC Handycam (Sony Electronics Inc., USA). Then, we analyzed the movement of the foreleg and pedipalps during the selected courtship bouts using ProAnalyst Lite software (Xcitex Inc., Woburn, Massachusetts, USA). We first set the x-axis and y-axis by where the pedipalp tip was in contact with the substrate (y-position 0) and most posterior point of the abdomen (x-position 0) at the beginning of the courtship bout. When the foreleg or pedipalps did not move during the courtship bout, the location of the joint was recorded by the location of the parts at the cocked position. In the case of the image being blurred, the location of blurred points was guessed based on the previous or subsequent frames or other parts in the current frame., , # Computational analyses of the courtship dance of male wolf spiders

    • 4 Python codes, 1 R code and 4 CSV files are included.
    1. 0_raw_data_process.py
    • fill the non-observed values with the initial position of each features
    • create gif and png figures to describe the visual display
    • require the following packages
      • numpy, pandas, seaborn, matplotlib, math
    1. 1_rabidosa_pose_cluster.py
    • conduct clustering posture of forelegs from each frame
    • using UMAP and HDBSCAN
    • require the following packages
      • umap, hdbscan, pickle, pandas, numpy, tensorflow, seaborn, matplotlib, scipy, sklearn
    1. 2_rabidosa_LSTM.py
    • train and save LSTM model of dynamic visual display of male R. rabida
    • clustering visual displays using umap and hdbscan
    • require the following packages
      • umap, hdbscan, pickle, pandas, numpy, tensorflow, seaborn, matplotlib, tsaug, sklearn
    1. 3_trad_clustering.py
    • clustering visual displays using traditional features with umap and hdbscan
    • require the...
  8. f

    Table_1_MorphoGlia, an interactive method to identify and map microglia...

    • figshare.com
    • frontiersin.figshare.com
    docx
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Pablo Maya-Arteaga; Humberto Martínez-Orozco; Sofía Diaz-Cintra (2024). Table_1_MorphoGlia, an interactive method to identify and map microglia morphologies, demonstrates differences in hippocampal subregions of an Alzheimer’s disease mouse model.DOCX [Dataset]. http://doi.org/10.3389/fncel.2024.1505048.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Frontiers
    Authors
    Juan Pablo Maya-Arteaga; Humberto Martínez-Orozco; Sofía Diaz-Cintra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Microglia are dynamic central nervous system cells crucial for maintaining homeostasis and responding to neuroinflammation, as evidenced by their varied morphologies. Existing morphology analysis often fails to detect subtle variations within the full spectrum of microglial morphologies due to their reliance on predefined categories. Here, we present MorphoGlia, an interactive, user-friendly pipeline that objectively characterizes microglial morphologies. MorphoGlia employs a machine learning ensemble to select relevant morphological features of microglia cells, perform dimensionality reduction, cluster these features, and subsequently map the clustered cells back onto the tissue, providing a spatial context for the identified microglial morphologies. We applied this pipeline to compare the responses between saline solution (SS) and scopolamine (SCOP) groups in a SCOP-induced mouse model of Alzheimer’s disease, with a specific focus on the hippocampal subregions CA1 and Hilus. Next, we assessed microglial morphologies across four groups: SS-CA1, SCOP-CA1, SS-Hilus, and SCOP-Hilus. The results demonstrated that MorphoGlia effectively differentiated between SS and SCOP-treated groups, identifying distinct clusters of microglial morphologies commonly associated with pro-inflammatory states in the SCOP groups. Additionally, MorphoGlia enabled spatial mapping of these clusters, identifying the most affected hippocampal layers. This study highlights MorphoGlia’s capability to provide unbiased analysis and clustering of microglial morphological states, making it a valuable tool for exploring microglial heterogeneity and its implications for central nervous system pathologies.

  9. GAMMA: Galactic Attributes of Mass, Metallicity, and Age Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ufuk Çakır; Ufuk Çakır (2023). GAMMA: Galactic Attributes of Mass, Metallicity, and Age Dataset [Dataset]. http://doi.org/10.5281/zenodo.8375344
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ufuk Çakır; Ufuk Çakır
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce the GAMMA (Galactic Attributes of Mass, Metallicity, and Age) dataset, a comprehensive collection of galaxy data tailored for Machine Learning applications. This dataset offers detailed 2D maps and 3D cubes of 11 727 galaxies, capturing essential attributes: stellar age, metallicity, and mass.

    Together with the dataset we publish our code to extract any other stellar or gaseous property from the raw simulation suite to extend the dataset beyond these initial properties, ensuring versatility for various computational tasks. Ideal for feature extraction, clustering, and regression tasks, GAMMA offers a unique lens for exploring galactic structures through computational methods and is a bridge between astrophysical simulations and the field of scientific machine learning (ML).

    As a first benchmark, we apply Principal Component Analysis (PCA) on this dataset. We find that PCA effectively captures the key morphological features of galaxies with a small number of components. We achieve a dimensionality reduction by a factor of ∼200 (∼3650) for 2D images (3D cubes) with a reconstruction accuracy below 5%.

    We calculate UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) on the lower dimensional PCA scores of the 2D images to visualize the image space. An interactive version of this plot can be accessed using an online Dashboard (hover over a point to see the galaxy image and the IllustrisTNG Subhalo ID).

    All the code to generate this dataset and load the data structure is publicly available on GitHub, with an additional documentation page hosted on ReadTheDocs.

  10. Material for manifold learning techniques comparison on benchmark dataset

    • springernature.figshare.com
    application/x-gzip
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elodie Laine; Valentin Lombard; Sergei Grudinin (2024). Material for manifold learning techniques comparison on benchmark dataset [Dataset]. http://doi.org/10.6084/m9.figshare.25112459.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jul 5, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elodie Laine; Valentin Lombard; Sergei Grudinin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This archive contains the restricted 10 ensemble benchmark and the scripts used in the manifold learning techniques assessment. Files related to an ensemble are prefixed with ID1_ID2_, where ID1 is the first member in alphabetical order, and ID2 is the reference for the structural alignment.

    The archive includes the following for each member of the benchmark: A _mm.pdb file containing the ensemble's conformations. A _aln.fa file, which is the multiple sequence alignment of the ensemble. A _rmsd.txt file with the all pairwise root mean squared deviation (RMSD) of the ensemble. A _raw_coords_ca.bin file with the raw coordinates in binary format. A _raw_coords_ca_mask.bin file with the binary format gap coordinates. A _features_pca.csv file detailing the positions of each sample in the ensemble's principal component space. A _dist_to_hull.csv file with the ID of each ensemble member, their label in the clustering in the PC space, and the squared distance of this sample to the convex hull formed by members of the other clusters. A _pca_errors.csv file containing the same information as the _dist_to_hull.csv file, but with the addition of the PCA reconstruction error, measured as the RMSD between the predicted and ground truth structures. The prediction of a sample is done by fitting the PCA to all clusters except the one being evaluated. Three _XXX_kcpa_errors.json files with the kPCA reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using kPCA at different sigma and alpha parameters from the grid search. The XXX indicates the kernel used. The prediction of a sample is done by fitting the kPCA to all clusters except the one being evaluated. A _umap_errors.json file with the UMAP reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using UMAP at different n_neigh and min_dist parameters from the grid search. The prediction of a sample is done by fitting the UMAP to all clusters except the one being evaluated. UMAP could be run only on a subset of the ensembles. A _rbf_kpca_default_sigma.json file containing the kPCA reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using kPCA with RBF kernel at the default alpha and sigma parameters. The prediction of a sample is done by fitting the kPCA to all clusters except the one being evaluated. A _rbf_kpca_errors_real.json file with the kPCA reconstruction errors for each ensemble member, measured as the RMSD between the predicted and ground truth structures, using kPCA with RBF kernel with a predicted optimal sigma parameter and alpha parameters of 1.0, 1e-5, and 1e-6. The prediction of a sample is done by fitting the kPCA to all clusters except the one being evaluated. The scripts used to generate the convex hull and for the PCA-kPCA comparison are as follows: dist_to_hull.py computes the coordinates in the PC space of each member, divides the members into clusters, and computes the distance of each member to the convex hull formed by members of the other clusters in the PC space. This script uses polytope_app.cpp with a Python binding to compute the squared distance of each member to the convex hull. polytope_module.so is the compiled C++ module called by the Python script. interpol_apase.py computes the interpolation in the ATPase latent space, and outputs the .pdb files of the trajectories. pca_kpca.py calculates the reconstruction error for both PCA, kPCA, and UMAP for each ensemble member by fitting the PCA, kPCA, or UMAP to all members of other clusters, excluding the cluster of the member currently being evaluated. A procheck folder containing summary tables of the procheck analysis on original and reconstructed structures. The stats.csv file contains descriptive information about the benchmark. Please consult the related documentation to understand the meaning of each column in this file.

  11. Identifying galaxies, quasars and stars with machine learning: a new...

    • zenodo.org
    bin
    Updated Oct 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Clarke; Alex Clarke (2020). Identifying galaxies, quasars and stars with machine learning: a new catalogue of classifications for 111 million SDSS sources without spectra - parquet format [Dataset]. http://doi.org/10.5281/zenodo.4060257
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 1, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alex Clarke; Alex Clarke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the same as the published data available under 10.5281/zenodo.3768398, but in the format of parquet files. This means you can access it using Dask for convenience when using cloud compute facilities.

    Abstract: We used 3.1 million spectroscopically labelled sources from the Sloan Digital Sky Survey (SDSS) to train an optimised random forest classifier using photometry from the SDSS and the Widefield Infrared Survey Explorer (WISE). We applied this machine learning model to 111 million previously unlabelled sources from the SDSS photometric catalogue which did not have existing spectroscopic observations. Our new catalogue contains 50.4 million galaxies, 2.1 million quasars, and 58.8 million stars. We provide individual classification probabilities for each source, with 6.7 million galaxies (13%), 0.33 million quasars (15%), and 41.3 million stars (70%) having classification probabilities greater than 0.99; and 35.1 million galaxies (70%), 0.72 million quasars (34%), and 54.7 million stars (93%) having classification probabilities greater than 0.9. Precision, Recall, and F1 score were determined as a function of selected features and magnitude error. We investigate the effect of class imbalance on our machine learning model and discuss the implications of transfer learning for populations of sources at fainter magnitudes than the training set. We used a non-linear dimension reduction technique (Uniform Manifold Approximation and Projection: UMAP) in unsupervised, semi-supervised, and fully-supervised schemes to visualise the separation of galaxies, quasars, and stars in a two-dimensional space. When applying this algorithm to the 111 million sources without spectra, it is in strong agreement with the class labels applied by our random forest model.

    When using this dataset, please reference our paper via the journal (https://arxiv.org/abs/1909.10963) and this DOI (10.5281/zenodo.4060257). If you make use of our scripts please reference our Github repository DOI (10.5281/zenodo.3855160).

    File descriptions:

    All of these files are Pandas Dataframes, saved as uncompressed parquet files for ease of access when using cloud compute such as Dask. df_spec_classprobs.parquet contains the spectroscopically observed sources used for training and testing. This has been cleaned, and has the results of the random forest classifier added as additional columns (sources used for training have NaNs in the class_pred column). SDSS-ML-all.parquet contains the 111 million photometrically observed sources, with our class labels and probabilities added.

  12. Interactive UMAP plot of the Australia recordings.

    • plos.figshare.com
    html
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Williams; Santiago M. Balvanera; Sarab S. Sethi; Timothy A.C. Lamont; Jamaluddin Jompa; Mochyudho Prasetya; Laura Richardson; Lucille Chapuis; Emma Weschke; Andrew Hoey; Ricardo Beldade; Suzanne C. Mills; Anne Haguenauer; Frederic Zuberer; Stephen D. Simpson; David Curnick; Kate E. Jones (2025). Interactive UMAP plot of the Australia recordings. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013029.s005
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 9, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ben Williams; Santiago M. Balvanera; Sarab S. Sethi; Timothy A.C. Lamont; Jamaluddin Jompa; Mochyudho Prasetya; Laura Richardson; Lucille Chapuis; Emma Weschke; Andrew Hoey; Ricardo Beldade; Suzanne C. Mills; Anne Haguenauer; Frederic Zuberer; Stephen D. Simpson; David Curnick; Kate E. Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Australia
    Description

    Interactive UMAP plot of the Australia recordings.

  13. f

    Interactive UMAP plot of the French Polynesia recordings.

    • figshare.com
    html
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Williams; Santiago M. Balvanera; Sarab S. Sethi; Timothy A.C. Lamont; Jamaluddin Jompa; Mochyudho Prasetya; Laura Richardson; Lucille Chapuis; Emma Weschke; Andrew Hoey; Ricardo Beldade; Suzanne C. Mills; Anne Haguenauer; Frederic Zuberer; Stephen D. Simpson; David Curnick; Kate E. Jones (2025). Interactive UMAP plot of the French Polynesia recordings. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013029.s006
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 9, 2025
    Dataset provided by
    PLOS Computational Biology
    Authors
    Ben Williams; Santiago M. Balvanera; Sarab S. Sethi; Timothy A.C. Lamont; Jamaluddin Jompa; Mochyudho Prasetya; Laura Richardson; Lucille Chapuis; Emma Weschke; Andrew Hoey; Ricardo Beldade; Suzanne C. Mills; Anne Haguenauer; Frederic Zuberer; Stephen D. Simpson; David Curnick; Kate E. Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French Polynesia
    Description

    Interactive UMAP plot of the French Polynesia recordings.

  14. Comparison of machine-learning methods by different measurements for CyTOF...

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types). [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types).

  15. f

    Two CyTOF benchmark data sets for analysis.

    • plos.figshare.com
    xls
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Two CyTOF benchmark data sets for analysis. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two CyTOF benchmark data sets for analysis.

  16. Comparison of methods for averaging performance in the identification of...

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets.

  17. f

    Data from: Multitarget Natural Compounds for Ischemic Stroke Treatment:...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junyu Zhou; Chen Li; Yu Yue; Yong Kwan Kim; Sunmin Park (2025). Multitarget Natural Compounds for Ischemic Stroke Treatment: Integration of Deep Learning Prediction and Experimental Validation [Dataset]. http://doi.org/10.1021/acs.jcim.5c00135.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    ACS Publications
    Authors
    Junyu Zhou; Chen Li; Yu Yue; Yong Kwan Kim; Sunmin Park
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Ischemic stroke’s complex pathophysiology demands therapeutic approaches targeting multiple pathways simultaneously, yet current treatments remain limited. We developed an innovative drug discovery pipeline combining a deep learning approach with experimental validation to identify natural compounds with comprehensive neuroprotective properties. Our computational framework integrated SELFormer, a transformer-based deep learning model, and multiple deep learning algorithms to predict NC bioactivity against seven crucial stroke-related targets (ACE, GLA, MMP9, NPFFR2, PDE4D, and eNOS). The pipeline encompassed IC50 predictions, clustering analysis, quantitative structure–activity relationship (QSAR) modeling, and uniform manifold approximation and projection (UMAP)-based bioactivity profiling followed by molecular docking studies and experimental validation. Analysis revealed six distinct NC clusters with unique molecular signatures. UMAP projection identified 11 medium-activity (6 < pIC50 ≤ 7) and 57 high-activity (pIC50 > 7) compounds, with molecular docking confirming strong correlations between binding energies and predicted pIC50 values. In vitro studies using NGF-differentiated PC12 cells under oxygen-glucose deprivation demonstrated significant neuroprotective effects of four high-activity compounds: feruloyl glucose, l-hydroxy-l-tryptophan, mulberrin, and ellagic acid. These compounds enhanced cell viability, reduced acetylcholinesterase activity and lipid peroxidation, suppressed TNF-α expression, and upregulated BDNF mRNA levels. Notably, mulberrin and ellagic acid showed superior efficacy in modulating oxidative stress, inflammation, and neurotrophic signaling. This study establishes a robust deep learning-driven framework for identifying multitarget natural therapeutics for ischemic stroke. The validated compounds, particularly mulberrin and ellagic acid, are promising for stroke treatment development. Our findings demonstrate the effectiveness of integrating computational prediction with experimental validation in accelerating drug discovery for complex neurological disorders.

  18. Calibration of cell types utilizing calibration feedback for CyTOF1 and...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li (2023). Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008885.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lijun Cheng; Pratik Karkhanis; Birkan Gokbag; Yueze Liu; Lang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data.

  19. Enriched pathways of the top 500 genes’ mRNA expression level associated...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takuma Shibahara; Chisa Wada; Yasuho Yamashita; Kazuhiro Fujita; Masamichi Sato; Junichi Kuwata; Atsushi Okamoto; Yoshimasa Ono (2023). Enriched pathways of the top 500 genes’ mRNA expression level associated with the 1D embeddings of RV η, RF ρ, and the inner vectors of SNNs and DeepCC in UMAP. [Dataset]. http://doi.org/10.1371/journal.pone.0286072.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Takuma Shibahara; Chisa Wada; Yasuho Yamashita; Kazuhiro Fujita; Masamichi Sato; Junichi Kuwata; Atsushi Okamoto; Yoshimasa Ono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Enriched pathways of the top 500 genes’ mRNA expression level associated with the 1D embeddings of RV η, RF ρ, and the inner vectors of SNNs and DeepCC in UMAP.

  20. Library 1 LY6A UMAP cluster sequences.

    • plos.figshare.com
    txt
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Huang; Albert T. Chen; Ken Y. Chan; Hikari Sorensen; Andrew J. Barry; Bahar Azari; Qingxia Zheng; Thomas Beddow; Binhui Zhao; Isabelle G. Tobey; Cynthia Moncada-Reid; Fatma-Elzahraa Eid; Christopher J. Walkey; M. Cecilia Ljungberg; William R. Lagor; Jason D. Heaney; Yujia A. Chan; Benjamin E. Deverman (2023). Library 1 LY6A UMAP cluster sequences. [Dataset]. http://doi.org/10.1371/journal.pbio.3002112.s017
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Qin Huang; Albert T. Chen; Ken Y. Chan; Hikari Sorensen; Andrew J. Barry; Bahar Azari; Qingxia Zheng; Thomas Beddow; Binhui Zhao; Isabelle G. Tobey; Cynthia Moncada-Reid; Fatma-Elzahraa Eid; Christopher J. Walkey; M. Cecilia Ljungberg; William R. Lagor; Jason D. Heaney; Yujia A. Chan; Benjamin E. Deverman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Viruses have evolved the ability to bind and enter cells through interactions with a wide variety of cell macromolecules. We engineered peptide-modified adeno-associated virus (AAV) capsids that transduce the brain through the introduction of de novo interactions with 2 proteins expressed on the mouse blood–brain barrier (BBB), LY6A or LY6C1. The in vivo tropisms of these capsids are predictable as they are dependent on the cell- and strain-specific expression of their target protein. This approach generated hundreds of capsids with dramatically enhanced central nervous system (CNS) tropisms within a single round of screening in vitro and secondary validation in vivo thereby reducing the use of animals in comparison to conventional multi-round in vivo selections. The reproducible and quantitative data derived via this method enabled both saturation mutagenesis and machine learning (ML)-guided exploration of the capsid sequence space. Notably, during our validation process, we determined that nearly all published AAV capsids that were selected for their ability to cross the BBB in mice leverage either the LY6A or LY6C1 protein, which are not present in primates. This work demonstrates that AAV capsids can be directly targeted to specific proteins to generate potent gene delivery vectors with known mechanisms of action and predictable tropisms.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HyeongChan Kim (2025). umap-learn [Dataset]. https://www.kaggle.com/kozistr/umaplearn
Organization logo

umap-learn

Uniform Manifold Approximation and Projection (UMAP)

Explore at:
zip(46934808 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
HyeongChan Kim
Description

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

The data is uniformly distributed on a Riemannian manifold; The Riemannian metric is locally constant (or can be approximated as such); The manifold is locally connected. From these assumptions, it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

Search
Clear search
Close search
Google apps
Main menu