39 datasets found
  1. u

    BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding....

    • board.unimib.it
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo Dall'Olio (2023). BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data. [Dataset]. http://doi.org/10.17632/j8xbwb93x9.1
    Explore at:
    Dataset updated
    Jan 23, 2023
    Authors
    Lorenzo Dall'Olio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We propose a Bayesian Reduction for Amplified Quantization in Umap Embedding (BRAQUE) as an integrative novel approach, from data preprocessing to phenotype classification. BRAQUE starts with an innovative preprocessing, named Lognormal Shrinkage, able to enhance input fragmentation by fitting a lognormal mixture model and shrinking each component towards its median, in order to help further clustering step in finding more separated and clear clusters. The BRAQUE’s pipeline consist of a dimensionality reduction step performed using UMAP, and a clustering performed using HDBSCAN on UMAP embedding. These SUPPLEMENTAL DATA contain the csv image data files for seven lymphoid tissues, the antibody list and an MTA agreement letter for the CyBorgh software.

  2. h

    glove.6B.50d.umap.2d

    • huggingface.co
    Updated Jan 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Tormo Romero (2024). glove.6B.50d.umap.2d [Dataset]. https://huggingface.co/datasets/mt0rm0/glove.6B.50d.umap.2d
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2024
    Authors
    Mario Tormo Romero
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card

    This dataset is a UMAP 2D-projection of the glove.6B.50d embeddings from Stanford. It is intended as a fast reference for visualizing embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    The embeddings have a vocabulary of 400k tokens with 2 dimensions each token. Curated by: Mario Tormo Romero License: cc0-1.0

      Dataset Sources
    

    This Dataset has been… See the full description on the dataset page: https://huggingface.co/datasets/mt0rm0/glove.6B.50d.umap.2d.

  3. f

    ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1

    • figshare.com
    application/gzip
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1 [Dataset]. http://doi.org/10.6084/m9.figshare.12478571.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    figshare
    Authors
    Massimo Andreatta; Santiago Carmona
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).

  4. h

    laion-aesthetics-12m-umap

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David McClure (2023). laion-aesthetics-12m-umap [Dataset]. https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Authors
    David McClure
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    LAION-Aesthetics :: CLIP → UMAP

    This dataset is a CLIP (text) → UMAP embedding of the LAION-Aesthetics dataset - specifically the improved_aesthetics_6plus version, which filters the full dataset to images with scores of > 6 under the "aesthetic" filtering model. Thanks LAION for this amazing corpus!

    The dataset here includes coordinates for 3x separate UMAP fits using different values for the n_neighbors parameter - 10, 30, and 60 - which are broken out as separate columns with… See the full description on the dataset page: https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap.

  5. Z

    Dataset and trained models belonging to the article 'Distant reading...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ros, Ruben (2021). Dataset and trained models belonging to the article 'Distant reading patterns of iconicity in 940.000 online circulations of 26 iconic photographs' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4244000
    Explore at:
    Dataset updated
    Sep 28, 2021
    Dataset provided by
    Smits, Thomas
    Ros, Ruben
    Description

    Quantifying Iconicity - Zenodo

    The Dataset

    This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.

    The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes: - the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match - the title of the page - the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found - the language found by the langid Python module link, along with the normalized score. - the labels associated with the image by Google - the scrape date

    Alongside the .tsv-files, there are several other elements in the following folder structure:

    ├── data
    │  ├── embeddings
    │        └── doc2vec
    │        └── input-text
    │        └── metadata
    │        └── umap
    │  └── evaluation
    │  └── results
    │        └── diachronic-plots
    │        └── top-words
    │  └── tsv
    
    1. The /embeddings folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata.
    2. The /evaluation folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters.
    3. The /results folder contains the top words associated with the clusters and the diachronic cluster prominence plots.

    Data Cleaning and Curation

    Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as , etc.

  6. f

    ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2

    • figshare.com
    application/gzip
    Updated Jul 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of virus-specific CD8 T cells, version 2 [Dataset]. http://doi.org/10.6084/m9.figshare.23764572.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    figshare
    Authors
    Massimo Andreatta; Santiago Carmona
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. Single-cell data to build the virus-specific CD8 T cell reference map were downloaded from GEO under the following entries: GSE131535, GSE134139 and GSE119943, selecting only samples in wild type conditions. Data for the Ptpn2-KO, Tox-KO and CD4-depletion projections were obtained from entries GSE134139, GSE119943, and GSE137007 and were not included in the construction of the reference map. To construct the LCMV reference map, we split the dataset into five batches that displayed strong batch effects, and applied STACAS (https://github.com/carmonalab/STACAS) to mitigate its confounding effects. We computed 800 variable genes per batch, excluding cell cycling genes, ribosomal and mitochondrial genes, and computed pairwise anchors using 200 integration genes, and otherwise default STACAS parameters. Anchors were filtered at the default threshold 0.8 percentile, and integration was performed with the IntegrateData Seurat3 function with the guide tree suggested by STACAS. Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.4, reduction=”pca”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).

  7. f

    Data_Sheet_1_Manifold learning for fMRI time-varying functional...

    • frontiersin.figshare.com
    docx
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini (2023). Data_Sheet_1_Manifold learning for fMRI time-varying functional connectivity.docx [Dataset]. http://doi.org/10.3389/fnhum.2023.1134012.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 11, 2023
    Dataset provided by
    Frontiers
    Authors
    Javier Gonzalez-Castillo; Isabel S. Fernandez; Ka Chun Lam; Daniel A. Handwerker; Francisco Pereira; Peter A. Bandettini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.

  8. Z

    Joint embedding of vertebrate brain single-cell RNA-Seq using sequence or...

    • data.niaid.nih.gov
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun, Dennis (2023). Joint embedding of vertebrate brain single-cell RNA-Seq using sequence or structure [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7838975
    Explore at:
    Dataset updated
    Aug 18, 2023
    Dataset authored and provided by
    Sun, Dennis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Embeddings of single-cell RNA-Seq data from three adult vertebrate brain datasets into Orthogroup feature space or Structural cluster feature space. Orthogroups were generated using OrthoFinder v5.5.0; Structural clusters were assigned by using FoldSeek to cluster AlphaFold-v4 structural predictions.

    The three datasets used as the basis for these embeddings were:

    sample "Brain8" from the Jiang et al. 2021 zebrafish cell atlas (files beginning with GSM3768152)

    sample "Brain1" from the Han et al. 2018 mouse cell atlas (files beginning with GSM2906405)

    sample "Xenopus_brain_COL65" from the Liao et al. 2022 Xenopus laevis adult cell atlas (files beginning with GSM6214268)

    For each dataset, we also generated a standardized cell type annotation file based on the author's originally provided cell type annotation data. The first column is the cell barcode for that species and the second column is the original study's cell type annotation for that cell.

    For the Xenopus brain data, we removed around ~18k cells that were not annotated in the original data to simplify data analyses - these are reflected in the files with the "subsampled" suffix. Subsampled versions of the data are also available for the joint embedding space (prefixed with "DrerMmusXlae").

    For the final datasets used in our analyses, we also provide features x cell matrices as .h5ad files for smaller file sizes and faster loading using Scanpy.

    For visualizing our UMAP plots of our top200 embedding space, we provide ".tsv" files with a variety of metrics and the x and y positions of each cell in the UMAP. See "DrerMmusXlae_adultbrain_FoldSeek_plotlydata.tsv" and "DrerMmusXlae_adultbrain_OrthoFinder_plotlydata.tsv"

    These data are part of the Arcadia Science Pub titled "Comparing gene expression across species based on protein structure instead of sequence".

  9. h

    s1K-1.1-850

    • huggingface.co
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    InfiX.ai (2025). s1K-1.1-850 [Dataset]. https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850
    Explore at:
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    InfiX.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This data is obtained by simplescaling/s1K-1.1. Compared with the original simplescaling/s1K-1.1 data, our filtered data uses less data and achieves better results.

      What we did
    

    Text Embedding Generation: We use all-MiniLM-L6-v2 (from SentenceTransformers library) to generate "input" embeddings.

    Dimensionality reduction: We use UMAP approach which preserves local and global data structures.

    n_components=2, n_neighbors=15, min_dist=0.1

    Data Sparsification (Dense Points… See the full description on the dataset page: https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850.

  10. f

    Results of the proposed methods compared to other methods from the...

    • figshare.com
    xls
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater (2024). Results of the proposed methods compared to other methods from the literature on the USPS dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300641.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 3, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of the proposed methods compared to other methods from the literature on the USPS dataset.

  11. D

    Data from: Data related to Panzer: A Machine Learning Based Approach to...

    • darus.uni-stuttgart.de
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Panzer (2024). Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins [Dataset]. http://doi.org/10.18419/DARUS-4576
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    DaRUS
    Authors
    Tim Panzer
    License

    https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576

    Time period covered
    Nov 1, 1976 - Feb 29, 2024
    Dataset funded by
    DFG
    Description

    This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7

  12. Test dataset for omero-vitessce

    • zenodo.org
    • data.niaid.nih.gov
    csv, json, png
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michele Bortolomeazzi; Michele Bortolomeazzi (2024). Test dataset for omero-vitessce [Dataset]. http://doi.org/10.5281/zenodo.13832665
    Explore at:
    png, json, csvAvailable download formats
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michele Bortolomeazzi; Michele Bortolomeazzi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Test datasets for omero-vitessce

    Dataset designed for testing the omero-vitessce (https://github.com/NFDI4BIOIMAGE/omero-vitessce) plugin for OMERO (https://www.openmicroscopy.org/omero/). The omero-vitessce repository contains a cropped version of this dataset for automated testing (https://github.com/NFDI4BIOIMAGE/omero-vitessce/tree/main/test/data/MB266).

    Files

    • MAX_MBEN_ff_Xenium_0018446_MB-266_DAPI_2024-01-23_12.47.34_Fused_405nm_corr_cropped.png = PNG image with the DAPI channel.
    • MAX_MBEN_ff_Xenium_0018446_MB-266_DAPI_2024-01-23_12.47.34_Fused_405nm_corr_cropped_cp_masks.png= Cell segmentation mask pixel values correspond to cell identities, 0 = background).
    • cells.csv =
    • embeddings.csv = UMAP embeddings for drawing an interactive scatterplot.
    • feature_matrix.csv = Transcript counts in each cell.
    • transcripts.csv = Gene name and coordinates (pixel) of each transcript.
    • VitessceConfig.json = Example configuration file generated by the omero-vitessce plugin for the Vitessce, an equivalent file can be generated by using the form provided by the plugin in OMERO.web.

    See the repository README file for more details on the formats of these files: https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#config-files

    Usage

    1. Add the omero-web-zarr and omero-vitessce plugins to your OMERO.web installation.
    2. Import the images into OMERO in the same dataset.
    3. Attach all the .csv data files.
    4. Use the form in the "Vitessce" tab of the right-panel to generate a configuration file and open the Vitessce viewer.

    See the repository README file for more details on usage (https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#usage) and installation (https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#installation)

    Data Sources

    Adapted from the full original data at: https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BIAD1093 (10.6019/S-BIAD1093).

    The original data were produced and analysed in the course of this study:

    https://www.biorxiv.org/content/10.1101/2024.04.03.586404v1

  13. d

    Replication Data for: Measuring the impact of campaign finance on...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lalisse, Matthias (2023). Replication Data for: Measuring the impact of campaign finance on congressional voting: A machine learning approach [Dataset]. http://doi.org/10.7910/DVN/DHQQHX
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lalisse, Matthias
    Description

    Replication data for the paper: "Measuring the impact of campaign finance on congressional voting: A machine learning approach" Includes: * metadata for legislators and bills, * text embeddings for legislative summaries (sourced from ProPublica Congress Database). Includes 768d LongFormer embeddings and 2d embeddings for visualization (UMAP and Isomap), * legislator embeddings: 100d PCA on legislators' financial disclosures, as well as 2d visualization embeddings (UMAP and Isomap), * scripts for running the classification and RSA analyses. Up to 100d embeddings are provided from the output of PCA for both bills and legislators. See README.ipynb for a tour of the datasets as well as starter code.

  14. Z

    GeoWaVe Cytometry Benchmark Data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Morgan (2022). GeoWaVe Cytometry Benchmark Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7134722
    Explore at:
    Dataset updated
    Oct 2, 2022
    Dataset provided by
    Simone Cuff
    Matthew Morgan
    Ross Jake Burton
    Matthias Eberl
    Andreas Artemiou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contained within this folder are six benchmark datasets (Levine13, Levine32, Samusik, Sepsis, and PD) used for the evaluation of the GeoWaVe ensemble clustering algorithm, part of the cytocluster (https://github.com/burtonrj/CytoCluster) package.

    The data are compensated, arc-sine transformed, and debris and dead cells removed. See manuscript for details: https://doi.org/10.1101/2022.06.30.496829

    Each dataset is available as a CSV file and includes two additional columns: UMAP1 and UMAP2. The UMAP columns contain embeddings generated using UMAP (2 components and n_neighbours=30) and were used for visualisation purposes. The column 'population' contains the original population labels generated using manual gating.

  15. Data from: A highly resolved integrated single-cell atlas of HPV-negative...

    • zenodo.org
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lina Kroehling; Stefano Monti; Lina Kroehling; Stefano Monti (2024). A highly resolved integrated single-cell atlas of HPV-negative Head and Neck Cancer [Dataset]. http://doi.org/10.5281/zenodo.14579515
    Explore at:
    Dataset updated
    Dec 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lina Kroehling; Stefano Monti; Lina Kroehling; Stefano Monti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A single cell transcriptomic atlas of HPV-negative Head and Neck Squamous Carcinomas.

    The atlas is a Seurat (v 4.1.0) object stored as a .rds file, which can be loaded in R.

    The data can be loaded into R as follows:

    atlas <- readRDS('FullHNSCCAtlas.rds')

    The immune and nonimmune compartments can be separated as follows:

    immune <- subset(atlas, cells = rownames(atlas@meta.data[atlas@meta.data$Compartment == "Immune",]))
    immune[["umap"]]@cell.embeddings <- as.matrix(immune@meta.data[,c("CompartmentUMAP1_Coordinates", "CompartmentUMAP2_Coordinates")])
    colnames(x = immune[["umap"]]@cell.embeddings) <- paste0("UMAP_", 1:2)


    nonimmune <- subset(atlas, cells = rownames(atlas@meta.data[atlas@meta.data$Compartment == "nonImmune",]))
    nonimmune[["umap"]]@cell.embeddings <- as.matrix(nonimmune@meta.data[,c("CompartmentUMAP1_Coordinates", "CompartmentUMAP2_Coordinates")])
    colnames(x = nonimmune[["umap"]]@cell.embeddings) <- paste0("UMAP_", 1:2)

  16. f

    Results of the proposed methods compared to other methods from the...

    • figshare.com
    xls
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater (2024). Results of the proposed methods compared to other methods from the literature on the Banana dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300641.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 3, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of the proposed methods compared to other methods from the literature on the Banana dataset.

  17. LungMAP Azimuth Reference - Human Adult Lung scRNA-Seq

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Nov 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LungMAP; LungMAP (2021). LungMAP Azimuth Reference - Human Adult Lung scRNA-Seq [Dataset]. http://doi.org/10.5281/zenodo.5649206
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 6, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    LungMAP; LungMAP
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The LungMAP scRNA-Seq reference associated with the Lung CellCards resource. The initial reference integrated 259k cells from 72 donors from five published (PMIDs: 32726565, 32427931, 30554520, 32832599, 32832598) and one unpublished single cell RNA-seq cohort. Non-diseased adult and pediatric healthy lung single-cell 10x Genomics captures (3’ v2 and v3). Cells from different donors were integrated using Batchlor. Preliminary cell types were called based on Leiden clustering analysis and expression patterns of LungMAP cell card markers. UMAP embeddings were generated using monocle3. This reference, along with a corresponding single-nucleus specific version of this atlas, is under active construction. We expect to release the beta version of the reference in November 2021. Conforms to Azimuth reference data structure described at https://github.com/satijalab/azimuth/wiki/Azimuth-Reference-Format.

  18. f

    Pareto set of different methods on the USPS dataset.

    • plos.figshare.com
    xls
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater (2024). Pareto set of different methods on the USPS dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300641.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 3, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mohammed Shalaby; Mohamed Farouk; Hatem A. Khater
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pareto set of different methods on the USPS dataset.

  19. Dataset on Bibliographic, Textual, and Embedding Data for General Relativity...

    • zenodo.org
    bin
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphael Schlattmann; Raphael Schlattmann (2024). Dataset on Bibliographic, Textual, and Embedding Data for General Relativity and Gravitation Publications (1911–2000) [Dataset]. http://doi.org/10.5281/zenodo.14581503
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Raphael Schlattmann; Raphael Schlattmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset supplements the paper “Trajectories of Change: Approaches for Tracking Knowledge Evolution,” currently under review. It includes bibliographic, textual, and embedding data for 180,785 publications in General Relativity and Gravitation (GRG), spanning 1911 to 2000 and is based on the NASA/ADS. The file is in Parquet format with 33 columns.

    Usage

    The dataset is directly compatible with the UnigramKLD and EmbeddingDensities classes of the semanticlayertools Python package.

    Data Structure

    ColumnFormatDescriptionExample
    BibcodestringUnique publication identifier."1995PASP..107..803U"
    AuthorstringAuthors listed as comma-separated names."Urry CM, Padovani P"
    TitlestringTitle of the publication."Unified Schemes for Radio-Loud Active Galactic Nuclei"
    Title_enstringTitle translated into English."Unified Schemes for Radio-Loud Active Galactic Nuclei"
    YearintegerYear of publication.1995
    JournalstringJournal name."Publications of the Astronomical Society of the Pacific"
    Journal AbbreviationstringAbbreviated journal name."PASP"
    VolumestringVolume number (if applicable)."107"
    IssuestringIssue number (if applicable)."19"
    First PagestringStarting page."803"
    Last PagestringEnding page."25"
    AbstractstringAbstract text."The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..."
    Abstract_enstringAbstract translated into English."The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..."
    KeywordsstringComma-separated keywords."galaxies: active, galaxies: fundamental parameters, astrophysics"
    DOIstringDigital Object Identifier."10.1086/133630"
    AffiliationstringAuthor affiliations."AA(University of XYZ), AB(-)"
    CategorystringPublication type (e.g., article, book)."article"
    Citation CountfloatNumber of citations.4380.0
    Referencesarray of stringsList of cited Bibcodes.["1966Natur.209..751H", "1966Natur.211..468R", "1968ApJ...151..393S"]
    PDF_URLstringLink to the publication PDF."https://ui.adsabs.harvard.edu/link_gateway/1995PASP..107..803U/ADS_PDF"
    Title_langstringLanguage of the title."en"
    Abstract_langstringLanguage of the abstract."en"
    full_textstringFull text of the publication (where available)."Unified Schemes for Radio-Loud Active Galactic Nuclei. The appearance of AGN depends so strongly on..."
    tokensarray of stringsTokenized text of the title and abstract for computational analysis.["unify", "schemes", "radio", "loud", "active", "galactic", "nuclei"]
    UMAP-1float32UMAP embedding coordinate 1.10.423940
    UMAP-2float32UMAP embedding coordinate 2.7.890975
    ClusterintegerCluster label for topic modeling or grouping.15
    NamestringDescriptive cluster name."15_radio_quasars_sources_galaxies"
    KeyBERTstringKey phrases extracted via KeyBERT."radio galaxies, high redshift, radio sources, optical imaging"
    OpenAIstringEmbedding-based descriptive phrases."Cosmological Evolution of Radio-Loud Quasars"
    MMRstringExtracted key phrases using Maximal Marginal Relevance (MMR)."quasars, radio sources, redshift, luminosity, star formation"
    POSstringKey terms extracted via part-of-speech tagging."radio, quasars, sources, galaxies, redshift, optical"
    full_embeddingsarray of floatsText embeddings generated using OpenAI's text-embedding-3-large model."[ 0.01164897 -0.00343577 -0.03168862 ... 0.00237622]"

  20. Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of...

    • zenodo.org
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Allen; Peter Allen (2025). Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of Human Pulmonary Fibrosis [Dataset]. http://doi.org/10.5281/zenodo.15339001
    Explore at:
    Dataset updated
    May 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter Allen; Peter Allen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the preprocessed single-cell RNA data from the manuscript Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of Human Pulmonary Fibrosis (under submission).

    Data Descriptions:

    BAL_FINAL.rdsSeurat Object with the entire dataset
    BAL_FINAL.h5adScanpy Object with the entire dataset
    BAL_FINAL_metadata.txtMetadata for each cell
    mlm_umap_embeddings.csvUMAP embeddings for the monocyte-derived clusters
    ipf_allen2022.full_score.gzscDRS results for each cell using the Richard Allen et al. 2022 IPF GWAS summary statistics
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lorenzo Dall'Olio (2023). BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data. [Dataset]. http://doi.org/10.17632/j8xbwb93x9.1

BRAQUE: Bayesian Reduction for Amplified Quantization in UMAP Embedding. Supplementary data.

Explore at:
Dataset updated
Jan 23, 2023
Authors
Lorenzo Dall'Olio
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We propose a Bayesian Reduction for Amplified Quantization in Umap Embedding (BRAQUE) as an integrative novel approach, from data preprocessing to phenotype classification. BRAQUE starts with an innovative preprocessing, named Lognormal Shrinkage, able to enhance input fragmentation by fitting a lognormal mixture model and shrinking each component towards its median, in order to help further clustering step in finding more separated and clear clusters. The BRAQUE’s pipeline consist of a dimensionality reduction step performed using UMAP, and a clustering performed using HDBSCAN on UMAP embedding. These SUPPLEMENTAL DATA contain the csv image data files for seven lymphoid tissues, the antibody list and an MTA agreement letter for the CyBorgh software.

Search
Clear search
Close search
Google apps
Main menu