Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose a Bayesian Reduction for Amplified Quantization in Umap Embedding (BRAQUE) as an integrative novel approach, from data preprocessing to phenotype classification. BRAQUE starts with an innovative preprocessing, named Lognormal Shrinkage, able to enhance input fragmentation by fitting a lognormal mixture model and shrinking each component towards its median, in order to help further clustering step in finding more separated and clear clusters. The BRAQUE’s pipeline consist of a dimensionality reduction step performed using UMAP, and a clustering performed using HDBSCAN on UMAP embedding. These SUPPLEMENTAL DATA contain the csv image data files for seven lymphoid tissues, the antibody list and an MTA agreement letter for the CyBorgh software.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card
This dataset is a UMAP 2D-projection of the glove.6B.50d embeddings from Stanford. It is intended as a fast reference for visualizing embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute.
Dataset Details
Dataset Description
The embeddings have a vocabulary of 400k tokens with 2 dimensions each token. Curated by: Mario Tormo Romero License: cc0-1.0
Dataset Sources
This Dataset has been… See the full description on the dataset page: https://huggingface.co/datasets/mt0rm0/glove.6B.50d.umap.2d.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LAION-Aesthetics :: CLIP → UMAP
This dataset is a CLIP (text) → UMAP embedding of the LAION-Aesthetics dataset - specifically the improved_aesthetics_6plus version, which filters the full dataset to images with scores of > 6 under the "aesthetic" filtering model. Thanks LAION for this amazing corpus!
The dataset here includes coordinates for 3x separate UMAP fits using different values for the n_neighbors parameter - 10, 30, and 60 - which are broken out as separate columns with… See the full description on the dataset page: https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap.
Quantifying Iconicity - Zenodo
This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.
The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes:
- the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match
- the title of the page
- the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found
- the language found by the langid
Python module link, along with the normalized score.
- the labels associated with the image by Google
- the scrape date
Alongside the .tsv-files, there are several other elements in the following folder structure:
├── data
│ ├── embeddings
│ └── doc2vec
│ └── input-text
│ └── metadata
│ └── umap
│ └── evaluation
│ └── results
│ └── diachronic-plots
│ └── top-words
│ └── tsv
/embeddings
folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata./evaluation
folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters./results
folder contains the top words associated with the clusters and the diachronic cluster prominence plots.Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as ,
etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. Single-cell data to build the virus-specific CD8 T cell reference map were downloaded from GEO under the following entries: GSE131535, GSE134139 and GSE119943, selecting only samples in wild type conditions. Data for the Ptpn2-KO, Tox-KO and CD4-depletion projections were obtained from entries GSE134139, GSE119943, and GSE137007 and were not included in the construction of the reference map. To construct the LCMV reference map, we split the dataset into five batches that displayed strong batch effects, and applied STACAS (https://github.com/carmonalab/STACAS) to mitigate its confounding effects. We computed 800 variable genes per batch, excluding cell cycling genes, ribosomal and mitochondrial genes, and computed pairwise anchors using 200 integration genes, and otherwise default STACAS parameters. Anchors were filtered at the default threshold 0.8 percentile, and integration was performed with the IntegrateData Seurat3 function with the guide tree suggested by STACAS. Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.4, reduction=”pca”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Whole-brain functional connectivity (FC) measured with functional MRI (fMRI) evolves over time in meaningful ways at temporal scales going from years (e.g., development) to seconds [e.g., within-scan time-varying FC (tvFC)]. Yet, our ability to explore tvFC is severely constrained by its large dimensionality (several thousands). To overcome this difficulty, researchers often seek to generate low dimensional representations (e.g., 2D and 3D scatter plots) hoping those will retain important aspects of the data (e.g., relationships to behavior and disease progression). Limited prior empirical work suggests that manifold learning techniques (MLTs)—namely those seeking to infer a low dimensional non-linear surface (i.e., the manifold) where most of the data lies—are good candidates for accomplishing this task. Here we explore this possibility in detail. First, we discuss why one should expect tvFC data to lie on a low dimensional manifold. Second, we estimate what is the intrinsic dimension (ID; i.e., minimum number of latent dimensions) of tvFC data manifolds. Third, we describe the inner workings of three state-of-the-art MLTs: Laplacian Eigenmaps (LEs), T-distributed Stochastic Neighbor Embedding (T-SNE), and Uniform Manifold Approximation and Projection (UMAP). For each method, we empirically evaluate its ability to generate neuro-biologically meaningful representations of tvFC data, as well as their robustness against hyper-parameter selection. Our results show that tvFC data has an ID that ranges between 4 and 26, and that ID varies significantly between rest and task states. We also show how all three methods can effectively capture subject identity and task being performed: UMAP and T-SNE can capture these two levels of detail concurrently, but LE could only capture one at a time. We observed substantial variability in embedding quality across MLTs, and within-MLT as a function of hyper-parameter selection. To help alleviate this issue, we provide heuristics that can inform future studies. Finally, we also demonstrate the importance of feature normalization when combining data across subjects and the role that temporal autocorrelation plays in the application of MLTs to tvFC data. Overall, we conclude that while MLTs can be useful to generate summary views of labeled tvFC data, their application to unlabeled data such as resting-state remains challenging.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Embeddings of single-cell RNA-Seq data from three adult vertebrate brain datasets into Orthogroup feature space or Structural cluster feature space. Orthogroups were generated using OrthoFinder v5.5.0; Structural clusters were assigned by using FoldSeek to cluster AlphaFold-v4 structural predictions.
The three datasets used as the basis for these embeddings were:
sample "Brain8" from the Jiang et al. 2021 zebrafish cell atlas (files beginning with GSM3768152)
sample "Brain1" from the Han et al. 2018 mouse cell atlas (files beginning with GSM2906405)
sample "Xenopus_brain_COL65" from the Liao et al. 2022 Xenopus laevis adult cell atlas (files beginning with GSM6214268)
For each dataset, we also generated a standardized cell type annotation file based on the author's originally provided cell type annotation data. The first column is the cell barcode for that species and the second column is the original study's cell type annotation for that cell.
For the Xenopus brain data, we removed around ~18k cells that were not annotated in the original data to simplify data analyses - these are reflected in the files with the "subsampled" suffix. Subsampled versions of the data are also available for the joint embedding space (prefixed with "DrerMmusXlae").
For the final datasets used in our analyses, we also provide features x cell matrices as .h5ad files for smaller file sizes and faster loading using Scanpy.
For visualizing our UMAP plots of our top200 embedding space, we provide ".tsv" files with a variety of metrics and the x and y positions of each cell in the UMAP. See "DrerMmusXlae_adultbrain_FoldSeek_plotlydata.tsv" and "DrerMmusXlae_adultbrain_OrthoFinder_plotlydata.tsv"
These data are part of the Arcadia Science Pub titled "Comparing gene expression across species based on protein structure instead of sequence".
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This data is obtained by simplescaling/s1K-1.1. Compared with the original simplescaling/s1K-1.1 data, our filtered data uses less data and achieves better results.
What we did
Text Embedding Generation: We use all-MiniLM-L6-v2 (from SentenceTransformers library) to generate "input" embeddings.
Dimensionality reduction: We use UMAP approach which preserves local and global data structures.
n_components=2, n_neighbors=15, min_dist=0.1
Data Sparsification (Dense Points… See the full description on the dataset page: https://huggingface.co/datasets/InfiX-ai/s1K-1.1-850.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of the proposed methods compared to other methods from the literature on the USPS dataset.
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset designed for testing the omero-vitessce (https://github.com/NFDI4BIOIMAGE/omero-vitessce) plugin for OMERO (https://www.openmicroscopy.org/omero/). The omero-vitessce repository contains a cropped version of this dataset for automated testing (https://github.com/NFDI4BIOIMAGE/omero-vitessce/tree/main/test/data/MB266).
MAX_MBEN_ff_Xenium_0018446_MB-266_DAPI_2024-01-23_12.47.34_Fused_405nm_corr_cropped.png
= PNG image with the DAPI channel.MAX_MBEN_ff_Xenium_0018446_MB-266_DAPI_2024-01-23_12.47.34_Fused_405nm_corr_cropped_cp_masks.png
= Cell segmentation mask pixel values correspond to cell identities, 0 = background).cells.csv
= embeddings.csv
= UMAP embeddings for drawing an interactive scatterplot.feature_matrix.csv
= Transcript counts in each cell.transcripts.csv
= Gene name and coordinates (pixel) of each transcript.VitessceConfig.json
= Example configuration file generated by the omero-vitessce plugin for the Vitessce, an equivalent file can be generated by using the form provided by the plugin in OMERO.web.See the repository README file for more details on the formats of these files: https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#config-files
See the repository README file for more details on usage (https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#usage) and installation (https://github.com/NFDI4BIOIMAGE/omero-vitessce?tab=readme-ov-file#installation)
Adapted from the full original data at: https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BIAD1093 (10.6019/S-BIAD1093).
The original data were produced and analysed in the course of this study:
Replication data for the paper: "Measuring the impact of campaign finance on congressional voting: A machine learning approach" Includes: * metadata for legislators and bills, * text embeddings for legislative summaries (sourced from ProPublica Congress Database). Includes 768d LongFormer embeddings and 2d embeddings for visualization (UMAP and Isomap), * legislator embeddings: 100d PCA on legislators' financial disclosures, as well as 2d visualization embeddings (UMAP and Isomap), * scripts for running the classification and RSA analyses. Up to 100d embeddings are provided from the output of PCA for both bills and legislators. See README.ipynb for a tour of the datasets as well as starter code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contained within this folder are six benchmark datasets (Levine13, Levine32, Samusik, Sepsis, and PD) used for the evaluation of the GeoWaVe ensemble clustering algorithm, part of the cytocluster (https://github.com/burtonrj/CytoCluster) package.
The data are compensated, arc-sine transformed, and debris and dead cells removed. See manuscript for details: https://doi.org/10.1101/2022.06.30.496829
Each dataset is available as a CSV file and includes two additional columns: UMAP1 and UMAP2. The UMAP columns contain embeddings generated using UMAP (2 components and n_neighbours=30) and were used for visualisation purposes. The column 'population' contains the original population labels generated using manual gating.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A single cell transcriptomic atlas of HPV-negative Head and Neck Squamous Carcinomas.
The atlas is a Seurat (v 4.1.0) object stored as a .rds file, which can be loaded in R.
The data can be loaded into R as follows:
atlas <- readRDS('FullHNSCCAtlas.rds')
The immune and nonimmune compartments can be separated as follows:
immune <- subset(atlas, cells = rownames(atlas@meta.data[atlas@meta.data$Compartment == "Immune",]))
immune[["umap"]]@cell.embeddings <- as.matrix(immune@meta.data[,c("CompartmentUMAP1_Coordinates", "CompartmentUMAP2_Coordinates")])
colnames(x = immune[["umap"]]@cell.embeddings) <- paste0("UMAP_", 1:2)
nonimmune <- subset(atlas, cells = rownames(atlas@meta.data[atlas@meta.data$Compartment == "nonImmune",]))
nonimmune[["umap"]]@cell.embeddings <- as.matrix(nonimmune@meta.data[,c("CompartmentUMAP1_Coordinates", "CompartmentUMAP2_Coordinates")])
colnames(x = nonimmune[["umap"]]@cell.embeddings) <- paste0("UMAP_", 1:2)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of the proposed methods compared to other methods from the literature on the Banana dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LungMAP scRNA-Seq reference associated with the Lung CellCards resource. The initial reference integrated 259k cells from 72 donors from five published (PMIDs: 32726565, 32427931, 30554520, 32832599, 32832598) and one unpublished single cell RNA-seq cohort. Non-diseased adult and pediatric healthy lung single-cell 10x Genomics captures (3’ v2 and v3). Cells from different donors were integrated using Batchlor. Preliminary cell types were called based on Leiden clustering analysis and expression patterns of LungMAP cell card markers. UMAP embeddings were generated using monocle3. This reference, along with a corresponding single-nucleus specific version of this atlas, is under active construction. We expect to release the beta version of the reference in November 2021. Conforms to Azimuth reference data structure described at https://github.com/satijalab/azimuth/wiki/Azimuth-Reference-Format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pareto set of different methods on the USPS dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supplements the paper “Trajectories of Change: Approaches for Tracking Knowledge Evolution,” currently under review. It includes bibliographic, textual, and embedding data for 180,785 publications in General Relativity and Gravitation (GRG), spanning 1911 to 2000 and is based on the NASA/ADS. The file is in Parquet format with 33 columns.
The dataset is directly compatible with the UnigramKLD
and EmbeddingDensities
classes of the semanticlayertools Python package.
Column | Format | Description | Example |
---|---|---|---|
Bibcode | string | Unique publication identifier. | "1995PASP..107..803U" |
Author | string | Authors listed as comma-separated names. | "Urry CM, Padovani P" |
Title | string | Title of the publication. | "Unified Schemes for Radio-Loud Active Galactic Nuclei" |
Title_en | string | Title translated into English. | "Unified Schemes for Radio-Loud Active Galactic Nuclei" |
Year | integer | Year of publication. | 1995 |
Journal | string | Journal name. | "Publications of the Astronomical Society of the Pacific" |
Journal Abbreviation | string | Abbreviated journal name. | "PASP" |
Volume | string | Volume number (if applicable). | "107" |
Issue | string | Issue number (if applicable). | "19" |
First Page | string | Starting page. | "803" |
Last Page | string | Ending page. | "25" |
Abstract | string | Abstract text. | "The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..." |
Abstract_en | string | Abstract translated into English. | "The appearance of active galactic nuclei (AGN) depends strongly on orientation, dominating classification..." |
Keywords | string | Comma-separated keywords. | "galaxies: active, galaxies: fundamental parameters, astrophysics" |
DOI | string | Digital Object Identifier. | "10.1086/133630" |
Affiliation | string | Author affiliations. | "AA(University of XYZ), AB(-)" |
Category | string | Publication type (e.g., article, book). | "article" |
Citation Count | float | Number of citations. | 4380.0 |
References | array of strings | List of cited Bibcodes. | ["1966Natur.209..751H", "1966Natur.211..468R", "1968ApJ...151..393S"] |
PDF_URL | string | Link to the publication PDF. | "https://ui.adsabs.harvard.edu/link_gateway/1995PASP..107..803U/ADS_PDF" |
Title_lang | string | Language of the title. | "en" |
Abstract_lang | string | Language of the abstract. | "en" |
full_text | string | Full text of the publication (where available). | "Unified Schemes for Radio-Loud Active Galactic Nuclei. The appearance of AGN depends so strongly on..." |
tokens | array of strings | Tokenized text of the title and abstract for computational analysis. | ["unify", "schemes", "radio", "loud", "active", "galactic", "nuclei"] |
UMAP-1 | float32 | UMAP embedding coordinate 1. | 10.423940 |
UMAP-2 | float32 | UMAP embedding coordinate 2. | 7.890975 |
Cluster | integer | Cluster label for topic modeling or grouping. | 15 |
Name | string | Descriptive cluster name. | "15_radio_quasars_sources_galaxies" |
KeyBERT | string | Key phrases extracted via KeyBERT. | "radio galaxies, high redshift, radio sources, optical imaging" |
OpenAI | string | Embedding-based descriptive phrases. | "Cosmological Evolution of Radio-Loud Quasars" |
MMR | string | Extracted key phrases using Maximal Marginal Relevance (MMR). | "quasars, radio sources, redshift, luminosity, star formation" |
POS | string | Key terms extracted via part-of-speech tagging. | "radio, quasars, sources, galaxies, redshift, optical" |
full_embeddings | array of floats | Text embeddings generated using OpenAI's text-embedding-3-large model. | "[ 0.01164897 -0.00343577 -0.03168862 ... 0.00237622]" |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the preprocessed single-cell RNA data from the manuscript Transcriptomics of Alveolar Immune Cells Reveals Insight into Mechanisms of Human Pulmonary Fibrosis (under submission).
Data Descriptions:
BAL_FINAL.rds | Seurat Object with the entire dataset |
BAL_FINAL.h5ad | Scanpy Object with the entire dataset |
BAL_FINAL_metadata.txt | Metadata for each cell |
mlm_umap_embeddings.csv | UMAP embeddings for the monocyte-derived clusters |
ipf_allen2022.full_score.gz | scDRS results for each cell using the Richard Allen et al. 2022 IPF GWAS summary statistics |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose a Bayesian Reduction for Amplified Quantization in Umap Embedding (BRAQUE) as an integrative novel approach, from data preprocessing to phenotype classification. BRAQUE starts with an innovative preprocessing, named Lognormal Shrinkage, able to enhance input fragmentation by fitting a lognormal mixture model and shrinking each component towards its median, in order to help further clustering step in finding more separated and clear clusters. The BRAQUE’s pipeline consist of a dimensionality reduction step performed using UMAP, and a clustering performed using HDBSCAN on UMAP embedding. These SUPPLEMENTAL DATA contain the csv image data files for seven lymphoid tissues, the antibody list and an MTA agreement letter for the CyBorgh software.