45 datasets found
  1. R

    Umap Dataset

    • universe.roboflow.com
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eunahtest (2025). Umap Dataset [Dataset]. https://universe.roboflow.com/eunahtest/umap/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    eunahtest
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Mon2 Bounding Boxes
    Description

    Umap

    ## Overview
    
    Umap is a dataset for object detection tasks - it contains Mon2 annotations for 217 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  2. h

    laion-aesthetics-12m-umap

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David McClure (2023). laion-aesthetics-12m-umap [Dataset]. https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Authors
    David McClure
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    LAION-Aesthetics :: CLIP → UMAP

    This dataset is a CLIP (text) → UMAP embedding of the LAION-Aesthetics dataset - specifically the improved_aesthetics_6plus version, which filters the full dataset to images with scores of > 6 under the "aesthetic" filtering model. Thanks LAION for this amazing corpus!

    The dataset here includes coordinates for 3x separate UMAP fits using different values for the n_neighbors parameter - 10, 30, and 60 - which are broken out as separate columns with… See the full description on the dataset page: https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap.

  3. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

  4. f

    ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1

    • figshare.com
    application/gzip
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1 [Dataset]. http://doi.org/10.6084/m9.figshare.12478571.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    figshare
    Authors
    Massimo Andreatta; Santiago Carmona
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).

  5. e

    Categorizing WDs with Gaia XP spectra and UMAP - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Categorizing WDs with Gaia XP spectra and UMAP - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a450485e-90c8-5860-a8ba-48b5217c0ec9
    Explore at:
    Dataset updated
    Nov 6, 2024
    Description

    White dwarfs (WDs) polluted by exoplanetary material provide the unprecedented opportunity to directly observe the interiors of exoplanets. However, spectroscopic surveys are often limited by brightness constraints, and WDs tend to be very faint, making detections of large populations of polluted WDs difficult. In this paper, we aim to increase considerably the number of WDs with multiple metals in their atmospheres. Using 96134 WDs with Gaia DR3 BP/RP (XP) spectra, we constructed a 2D map using an unsupervised machine-learning technique called Uniform Manifold Approximation and Projection (UMAP) to organize the WDs into identifiable spectral regions. The polluted WDs are among the distinct spectral groups identified in our map. We have shown that this selection method could potentially increase the number of known WDs with five or more metal species in their atmospheres by an order of magnitude. Such systems are essential for characterizing exoplanet diversity and geology.

  6. Dataset for Mistic: an open-source multiplexed image t-SNE viewer

    • zenodo.org
    • data.niaid.nih.gov
    tiff
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandhya Prabhakaran; Sandhya Prabhakaran; Chandler Gatenbee; Chandler Gatenbee; Mark Robertson-Tessi; Mark Robertson-Tessi; Jeffrey West; Jeffrey West; Amer A. Beg; Jhanelle Gray; Scott Antonia; Scott Antonia; Robert A. Gatenby; Robert A. Gatenby; Alexander R.A. Anderson; Alexander R.A. Anderson; Amer A. Beg; Jhanelle Gray (2024). Dataset for Mistic: an open-source multiplexed image t-SNE viewer [Dataset]. http://doi.org/10.5281/zenodo.6131933
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sandhya Prabhakaran; Sandhya Prabhakaran; Chandler Gatenbee; Chandler Gatenbee; Mark Robertson-Tessi; Mark Robertson-Tessi; Jeffrey West; Jeffrey West; Amer A. Beg; Jhanelle Gray; Scott Antonia; Scott Antonia; Robert A. Gatenby; Robert A. Gatenby; Alexander R.A. Anderson; Alexander R.A. Anderson; Amer A. Beg; Jhanelle Gray
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This link consists of 10 anonymized non-small cell lung cancer (NSCLC) field of Views (FoVs) to test Mistic.

    Mistic

    Understanding the complex ecology of a tumor tissue and the spatio-temporal relationships between its cellular and microenvironment components is becoming a key component of translational research, especially in immune-oncology. The generation and analysis of multiplexed images from patient samples is of paramount importance to facilitate this understanding. In this work, we present Mistic, an open-source multiplexed image t-SNE viewer that enables the simultaneous viewing of multiple 2D images rendered using multiple layout options to provide an overall visual preview of the entire dataset. In particular, the positions of the images can be taken from t-SNE or UMAP coordinates. This grouped view of all the images further aids an exploratory understanding of the specific expression pattern of a given biomarker or collection of biomarkers across all images, helps to identify images expressing a particular phenotype or to select images for subsequent downstream analysis. Currently there is no freely available tool to generate such image t-SNEs.

    Links


    Mistic code

    Mistic documentation

    Paper

  7. D

    Data from: Data related to Panzer: A Machine Learning Based Approach to...

    • darus.uni-stuttgart.de
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Panzer (2024). Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins [Dataset]. http://doi.org/10.18419/DARUS-4576
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    DaRUS
    Authors
    Tim Panzer
    License

    https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576

    Time period covered
    Nov 1, 1976 - Feb 29, 2024
    Dataset funded by
    DFG
    Description

    This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7

  8. Expression of 97 surface markers and RNA (transcriptome wide) in 13165 cells...

    • figshare.com
    application/gzip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Velten; Sergio Triana; Simon Haas; Lea Jopp-Saile; Dominik Vonficht; Malte Paulsen (2023). Expression of 97 surface markers and RNA (transcriptome wide) in 13165 cells from a healthy young bone marrow donor [Dataset]. http://doi.org/10.6084/m9.figshare.13397987.v4
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lars Velten; Sergio Triana; Simon Haas; Lea Jopp-Saile; Dominik Vonficht; Malte Paulsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Seurat v3 object

    ASSAYS: AB: Antibody expression data RNA: mRNA expression data BOTH: Concatenated mRNA and antbody expression matrices

    DIMENSIONALITY REDUCTION MOFA: Multi-OMICS factor analysis to integrate AB and RNA data. MOFA served as input for clustering and further dimensionality reduction. MOFAUMAP: UMAP performed on MOFA dimensions. Display used in the manuscript.

    MOFATSNE: UMAP performed on MOFA dimensions. Projected: Data was projected on the reference dataset MOFAUMAP coordinates

    METADATA ct: Projected cell type (cell type labels from the reference dataset are used). Idents(object) uses an unsupervised clustering performed on this dataset.

    For the reference dataset, see https://doi.org/10.6084/m9.figshare.13397651.v2

    Changelog v3: Compared to the previous version of the file, projected UMAP coordinates and projected cell type labels were added. Also, neighborhood graphs and normalized data are now contained in the object. v4: Objects were slimed to correspond to the information described in our study. Data now only contains relevant dimensions reductions and metadata columns; unused RNA and antibody targets were excluded from the objects.

  9. o

    DCASE2021 UAD-S UMAP Data

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jul 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andres Fernandez Rodriguez; Mark D. Plumbley (2021). DCASE2021 UAD-S UMAP Data [Dataset]. http://doi.org/10.5281/zenodo.5123024
    Explore at:
    Dataset updated
    Jul 22, 2021
    Authors
    Andres Fernandez Rodriguez; Mark D. Plumbley
    Description

    Support data for our paper: USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS ArXiv preprint can be found here. Code for the experiment software pipeline described in the paper can be found here. The pipeline requires and generates different forms of data. Here we provide the following: AudioSet_wav_fragments.zip: This is a custom selection of 39437 wav files (32kHz, mono, 10 seconds) randomly extracted from AudioSet (originally released under CC-BY). In addition to this custom subset, the paper also uses the following ones, which can be downloaded at their respective websites: DCASE2021 Task 2 Development Dataset DCASE2021 Task 2 Additional Training Dataset Fraunhofer's IDMT-ISA-ELECTRIC-ENGINE Dataset dcase2021_uads_umaps.zip: To compute the UMAPs, first the log-STFT, log-mel and L3 representations must be extracted, and then the UMAPs must be computed. This can take a substantial amount of time and resources. For convenience, we provide here the 72 UMAPs discussed in the paper. dcase2021_uads_umap_plots.zip: Also for convenience, we provide here the 198 high-resolution scatter plots rendered from the UMAPs. For a comprehensive visual inspection of the computed representations, it is sufficient to download the plots only. Users interested in exploring the plots interactively will need to download all the audio datasets and compute the log-STFT, log-mel and L3 representations as well as the UMAPs themselves (code provided in the GitHub repository). UMAPs for further representations can also be computed and plotted.

  10. Reddit topics dataset

    • kaggle.com
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano viel (2024). Reddit topics dataset [Dataset]. https://www.kaggle.com/stefano1283/reddit-topic-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stefano viel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Interactive plots: https://www.redditopics.xyz

    Reddit Topics Dataset Overview

    This dataset provides a structured analysis of Reddit posts, originally sourced from the Pushshift dataset. It contains 43 million posts, reduced from an initial 1.7 billion through a robust pipeline focused on efficient data handling. The goal of this dataset is to categorize unstructured text posts into meaningful topics, making it easier to analyze the massive volume of Reddit data.

    The analysis was performed using the BERTopic model, which combines natural language processing (NLP) techniques with advanced clustering methods to identify and label distinct topics. The process involved embedding posts with BERT, reducing dimensionality using UMAP, clustering with HDBSCAN, and representing topics using c-TF-IDF. Finally, ChatGPT was used to assign human-readable names to each topic.

    This dataset is useful for anyone interested in social media analysis, natural language processing, or large-scale text analysis. It includes both the original posts and the topics assigned to each post.

  11. e

    H{alpha}-excess S-PLUS Catalogs - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). H{alpha}-excess S-PLUS Catalogs - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3d6b2cc2-b49c-5acd-9cda-fc213d3e07cf
    Explore at:
    Dataset updated
    Mar 23, 2025
    Description

    We use the Southern Photometric Local Universe Survey (S-PLUS) Fourth Data Release (DR4) to identify and classify H{alpha}-excess point source candidates in the Southern Sky. This approach combines photometric data from 12 S-PLUS filters with machine learning techniques to improve source classification and advance our understanding of H{alpha}-related phenomena. Our goal is to enhance the classification of H{alpha}-excess point sources by distinguishing between Galactic and extragalactic objects, particularly those with redshifted emission lines, and to identify sources where the H{alpha} excess is associated with variability phenomena, such as short-period RR Lyrae stars. We selected H{alpha}-excess candidates using the (r-J0660) versus (r-i) colour-colour diagram from the S-PLUS main survey (MS) and Galactic Disk Survey (GDS). Dimensionality reduction was achieved using UMAP, followed by HDBSCAN clustering. We refined this by incorporating infrared data, improving the separation of source types. A Random Forest model was then trained on the clustering results to identify key colour features for the classification of H{alpha}-excess sources. New, effective colour-colour diagrams were constructed by combining data from S-PLUS MS and infrared data. These diagrams, alongside tentative colour criteria, offer a preliminary classification of H{alpha}-excess sources without the need for complex algorithms. Combining multiwavelength photometric data with machine learning techniques significantly improved the classification of H{alpha}-excess sources. We identified 6956 sources with excess in the J0660 filter, and cross-matching with SIMBAD allowed us to explore the types of objects present in our catalog, including emission-line stars, young stellar objects, nebulae, stellar binaries, cataclysmic variables, variable stars, and extragalactic sources such as QSOs, AGNs, and galaxies. The cross-match also revealed X-ray sources, transients, and other peculiar objects. Using S-PLUS colours and machine learning, we successfully separated RR Lyrae stars from both other Galactic stars and extragalactic objects. Additionally, we achieved a clear separation between Galactic and extragalactic sources. However, distinguishing cataclysmic variables from QSOs at specific redshifts remained challenging. Incorporating infrared data refined the classification, enabling us to separate Galactic from extragalactic sources and to distinguish cataclysmic variables from QSOs. The Random Forest model, trained on HDBSCAN results, highlighted key colour features that distinguish the different classes of H{alpha}-excess sources, providing a robust framework for future studies such as follow-up spectroscopy. Cone search capability for table J/A+A/695/A104/hasms (Main survey H{alpha}-excess sources with UMAP/WISE) Cone search capability for table J/A+A/695/A104/hasgds (Galactic Disk Survey H{alpha}-excess sources)

  12. Z

    Descriptive Statistics and Town level Geospatial Distribution of...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ERASLAN, Doğu Kaan (2023). Descriptive Statistics and Town level Geospatial Distribution of Archaeological Settlements of Turkey in Iron Age (1200 –330 BCE) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4904041
    Explore at:
    Dataset updated
    May 6, 2023
    Dataset authored and provided by
    ERASLAN, Doğu Kaan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset is a byproduct of my phd thesis. It combines the Archaeological Settlements of Turkey (TAY) Project data with geo spatial data obtained from openstreetmaps.

    Content

    For each archaeological settlement, the data contains:

    active dates:

    geo spatial data which points to the town containing the settlement.

    information with respect to site type and its research status/methodology. These are all contained in the file taydata.json.

    The associated notebook to this dataset gives how each file is produced.

    We give several important statistics with respect to regions, and cities of Turkey for the Iron Age.

    If you want to visualize the data on a map. You can use the 1200_330_bce_sites_of_turkey.umap file. Just download the file and visualize it on umap or on framacarte

    Acknowledgements

    Without the immense effort of TAY Project and its researchers, this dataset would not be possible.

  13. d

    Acoustic features as a tool to visualize and explore marine soundscapes:...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2025). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
    Explore at:
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
    Time period covered
    Jan 1, 2023
    Description

    Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labell..., Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species.  We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3..., , # Data for: Acoustic features as a tool to visualize and explore marine soundscapes: applications illustrated using marine mammal Passive Acoustic Monitoring datasets.

    The data and scripts provided here allows replicating the results presented in the publication: "Acoustic features as a tool to visualize and explore marine soundscapes: applications illustrated using marine mammal Passive Acoustic Monitoring datasets."

    List of tables:

    SM_1_WMD_Features_and_Labels.csv -> table containing VGGish features extracted from audio files downloaded from the Watkins Marine Mammals Sounds Database ().

    Missing values in this dataset are marked as nan.

    Fields description:

    ID_row : progressive ID number for each row in the dataset

    0 - 127: labels for the 128 VGGish features.

    ID: reference to the Watkins Marine Mammal Sounds Database. Each ID corresponds to an audio file stored in the database.

    SPECIES: Species associated with each recording from the Watkins Marine Mam...

  14. h

    HumAIDSum100

    • huggingface.co
    Updated Aug 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blu (2025). HumAIDSum100 [Dataset]. https://huggingface.co/datasets/bluparsons/HumAIDSum100
    Explore at:
    Dataset updated
    Aug 29, 2025
    Authors
    Blu
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset card for HumAIDSum1000

    This dataset contains tweet summaries generated using GPT-4, tweets were obtained from the HumAID Twitter dataset created by Gliwa et al. (2019), which is several thousand tweets that has been collected during 19 major natural disasters, which happened from 2016 and 2019. The tweets were selected using stratified sampling, which should increase precision, and representativeness of the tweets. Stratums used for each file:

    Clustered using UMAP and… See the full description on the dataset page: https://huggingface.co/datasets/bluparsons/HumAIDSum100.

  15. h

    s1K-1.1-850

    • huggingface.co
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    InfiX.ai (2025). s1K-1.1-850 [Dataset]. https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850
    Explore at:
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    InfiX.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This data is obtained by simplescaling/s1K-1.1. Compared with the original simplescaling/s1K-1.1 data, our filtered data uses less data and achieves better results.

      What we did
    

    Text Embedding Generation: We use all-MiniLM-L6-v2 (from SentenceTransformers library) to generate "input" embeddings.

    Dimensionality reduction: We use UMAP approach which preserves local and global data structures.

    n_components=2, n_neighbors=15, min_dist=0.1

    Data Sparsification (Dense Points… See the full description on the dataset page: https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850.

  16. f

    sc-SynO use case 1: cardiac glial cell identification in the validation...

    • fairdomhub.org
    pdf
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Wolfien; Saptarshi Bej (2021). sc-SynO use case 1: cardiac glial cell identification in the validation datasets [Dataset]. https://fairdomhub.org/data_files/3894
    Explore at:
    pdf(180 KB)Available download formats
    Dataset updated
    Jan 20, 2021
    Authors
    Markus Wolfien; Saptarshi Bej
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Validation of the sc-SynO model for the first use case of cardiac glial cell annotation. UMAP representation of the manually clustered Bl6 dataset of Wolfien et al. (2020) Precicted cells of sc-SynO are highlighted in blue, cells not chosen are grey. UMAP representation of the manually clustered dataset of Vidal (2019). Precicted cells of sc-SynO are highlighted in blue, cells not chosen are grey. Average expression of the respective top five cardiac glial cell marker genes for both validation sets, including the predicted clusters and those in close proximity.

  17. GAMMA: Galactic Attributes of Mass, Metallicity, and Age Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ufuk Çakır; Ufuk Çakır (2023). GAMMA: Galactic Attributes of Mass, Metallicity, and Age Dataset [Dataset]. http://doi.org/10.5281/zenodo.8375344
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ufuk Çakır; Ufuk Çakır
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce the GAMMA (Galactic Attributes of Mass, Metallicity, and Age) dataset, a comprehensive collection of galaxy data tailored for Machine Learning applications. This dataset offers detailed 2D maps and 3D cubes of 11 727 galaxies, capturing essential attributes: stellar age, metallicity, and mass.

    Together with the dataset we publish our code to extract any other stellar or gaseous property from the raw simulation suite to extend the dataset beyond these initial properties, ensuring versatility for various computational tasks. Ideal for feature extraction, clustering, and regression tasks, GAMMA offers a unique lens for exploring galactic structures through computational methods and is a bridge between astrophysical simulations and the field of scientific machine learning (ML).

    As a first benchmark, we apply Principal Component Analysis (PCA) on this dataset. We find that PCA effectively captures the key morphological features of galaxies with a small number of components. We achieve a dimensionality reduction by a factor of ∼200 (∼3650) for 2D images (3D cubes) with a reconstruction accuracy below 5%.

    We calculate UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) on the lower dimensional PCA scores of the 2D images to visualize the image space. An interactive version of this plot can be accessed using an online Dashboard (hover over a point to see the galaxy image and the IllustrisTNG Subhalo ID).

    All the code to generate this dataset and load the data structure is publicly available on GitHub, with an additional documentation page hosted on ReadTheDocs.

  18. f

    Library 2 LY6A UMAP cluster sequences.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhao, Binhui; Walkey, Christopher J.; Chan, Yujia A.; Lagor, William R.; Azari, Bahar; Ljungberg, M. Cecilia; Sorensen, Hikari; Heaney, Jason D.; Beddow, Thomas; Eid, Fatma-Elzahraa; Huang, Qin; Tobey, Isabelle G.; Barry, Andrew J.; Deverman, Benjamin E.; Chen, Albert T.; Chan, Ken Y.; Moncada-Reid, Cynthia; Zheng, Qingxia (2023). Library 2 LY6A UMAP cluster sequences. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001038046
    Explore at:
    Dataset updated
    Jul 19, 2023
    Authors
    Zhao, Binhui; Walkey, Christopher J.; Chan, Yujia A.; Lagor, William R.; Azari, Bahar; Ljungberg, M. Cecilia; Sorensen, Hikari; Heaney, Jason D.; Beddow, Thomas; Eid, Fatma-Elzahraa; Huang, Qin; Tobey, Isabelle G.; Barry, Andrew J.; Deverman, Benjamin E.; Chen, Albert T.; Chan, Ken Y.; Moncada-Reid, Cynthia; Zheng, Qingxia
    Description

    Viruses have evolved the ability to bind and enter cells through interactions with a wide variety of cell macromolecules. We engineered peptide-modified adeno-associated virus (AAV) capsids that transduce the brain through the introduction of de novo interactions with 2 proteins expressed on the mouse blood–brain barrier (BBB), LY6A or LY6C1. The in vivo tropisms of these capsids are predictable as they are dependent on the cell- and strain-specific expression of their target protein. This approach generated hundreds of capsids with dramatically enhanced central nervous system (CNS) tropisms within a single round of screening in vitro and secondary validation in vivo thereby reducing the use of animals in comparison to conventional multi-round in vivo selections. The reproducible and quantitative data derived via this method enabled both saturation mutagenesis and machine learning (ML)-guided exploration of the capsid sequence space. Notably, during our validation process, we determined that nearly all published AAV capsids that were selected for their ability to cross the BBB in mice leverage either the LY6A or LY6C1 protein, which are not present in primates. This work demonstrates that AAV capsids can be directly targeted to specific proteins to generate potent gene delivery vectors with known mechanisms of action and predictable tropisms.

  19. CellTracksColab - T cell dataset (full)

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Jacquemet; Guillaume Jacquemet; Estibaliz Gómez-de-Mariscal; Estibaliz Gómez-de-Mariscal; Hanna Grobe; Hanna Grobe; Joanna Pylvänäinen; Joanna Pylvänäinen; Laura Xénard; Laura Xénard; Ricardo Henriques; Ricardo Henriques; Jean-Yves Tinevez; Jean-Yves Tinevez (2024). CellTracksColab - T cell dataset (full) [Dataset]. http://doi.org/10.5281/zenodo.11286110
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guillaume Jacquemet; Guillaume Jacquemet; Estibaliz Gómez-de-Mariscal; Estibaliz Gómez-de-Mariscal; Hanna Grobe; Hanna Grobe; Joanna Pylvänäinen; Joanna Pylvänäinen; Laura Xénard; Laura Xénard; Ricardo Henriques; Ricardo Henriques; Jean-Yves Tinevez; Jean-Yves Tinevez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the manuscript "CellTracksColab—A platform for compiling, analyzing, and exploring tracking data"

    This Zenodo archive contains:

    • The raw video (Tracks.zip)
    • The tracking files as XML and CSV files (Tracks.zip)
    • The CellTracksColab dataframes storing the dataset (CellTracksColab_results.zip)
    • The CellTracksColab outputs used to make the figures in the paper (CellTracksColab_results.zip)

    In brief:

    In summary, Lab-Tek 8 chamber slides (ThermoFisher) were prepared by overnight coating with either 2 μg/mL ICAM-1 or VCAM-1 at a temperature of 4°C. Subsequently, activated primary mouse CD4+ T cells were cleansed and suspended in L-15 media, enriched with 2 mg/mL D-glucose. These T cells were then placed into the chamber slides and incubated for 20 minutes. Post-incubation, a gentle wash was performed to eliminate all unattached cells. The imaging process was conducted using a 10x phase contrast objective at 37°C, utilizing a Zeiss Axiovert 200M microscope equipped with an automated X-Y stage and a Roper EMCCD camera. Time-lapse imaging was executed at intervals of 1 minute over 10 minutes, employing SlideBook 6 software from Intelligent Imaging Innovations.

    Cells were automatically tracked using StarDist, directly implemented within TrackMate. The StarDist model was trained using ZeroCostDL4Mic and is publicly available on Zenodo. This model generated excellent segmentation results on our test dataset (F1 score > 0.99). In TrackMate, the StarDist detector custom model (score threshold = 0.41 and overlap threshold = 0.5) and the Simple LAP tracker (linking max distance = 30 µm; gap closing max distance = 15 µm, gap closing max frame gap = 2 frames) were used.

    In CellTracksColab, we conducted a dimensionality reduction analysis employing Uniform Manifold Approximation and Projection (UMAP). The UMAP settings were as follows: number of neighbors (n_neighbors) set to 20, minimum distance (min_dist) to 0, and number of dimensions (n_dimension) to 2. This analysis utilized an array of track metrics, including:

    NUMBER_SPOTS, NUMBER_GAPS, NUMBER_SPLITS, NUMBER_MERGES, NUMBER_COMPLEX, LONGEST_GAP, TRACK_DISPLACEMENT, TRACK_MEAN_QUALITY, MAX_DISTANCE_TRAVELED, CONFINEMENT_RATIO, MEAN_STRAIGHT_LINE_SPEED, LINEARITY_OF_FORWARD_PROGRESSION, MEAN_DIRECTIONAL_CHANGE_RATE, Track Duration, Mean Speed, Median Speed, Max Speed, Min Speed, Speed Standard Deviation, Total Distance Traveled, Directionality, Tortuosity, MEAN_CIRCULARITY, MEAN_SOLIDITY, MEAN_SHAPE_INDEX, MEDIAN_CIRCULARITY, MEDIAN_SOLIDITY, MEDIAN_SHAPE_INDEX, STD_CIRCULARITY, STD_SOLIDITY, STD_SHAPE_INDEX, MIN_CIRCULARITY, MIN_SOLIDITY, MIN_SHAPE_INDEX, MAX_CIRCULARITY, MAX_SOLIDITY, MAX_SHAPE_INDEX

    Subsequently, clustering analysis was performed using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). The parameters included clustering_data_source set to UMAP, min_samples at 20, min_cluster_size at 200, and the metric employed was Euclidean.

  20. H

    Data for "Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A...

    • dataverse.harvard.edu
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Bennion (2025). Data for "Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis" [Dataset]. http://doi.org/10.7910/DVN/JXU6DC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Jonathan Bennion
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
eunahtest (2025). Umap Dataset [Dataset]. https://universe.roboflow.com/eunahtest/umap/dataset/1

Umap Dataset

umap

umap-dataset

Explore at:
zipAvailable download formats
Dataset updated
May 22, 2025
Dataset authored and provided by
eunahtest
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured
Mon2 Bounding Boxes
Description

Umap

## Overview

Umap is a dataset for object detection tasks - it contains Mon2 annotations for 217 images.

## Getting Started

You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.

  ## License

  This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Search
Clear search
Close search
Google apps
Main menu