Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This link consists of 10 anonymized non-small cell lung cancer (NSCLC) field of Views (FoVs) to test Mistic.
Mistic
Understanding the complex ecology of a tumor tissue and the spatio-temporal relationships between its cellular and microenvironment components is becoming a key component of translational research, especially in immune-oncology. The generation and analysis of multiplexed images from patient samples is of paramount importance to facilitate this understanding. In this work, we present Mistic, an open-source multiplexed image t-SNE viewer that enables the simultaneous viewing of multiple 2D images rendered using multiple layout options to provide an overall visual preview of the entire dataset. In particular, the positions of the images can be taken from t-SNE or UMAP coordinates. This grouped view of all the images further aids an exploratory understanding of the specific expression pattern of a given biomarker or collection of biomarkers across all images, helps to identify images expressing a particular phenotype or to select images for subsequent downstream analysis. Currently there is no freely available tool to generate such image t-SNEs.
Links
Mistic code
Mistic documentation
Paper
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains genotype likelihood estimations derived from open-access whole-genome re-sequencing datasets of the scimitar-horned oryx (SO). The dataset was downsampled to exhibit varying coverage levels, including 6x, 2x, and 0.5x. Genotype likelihoods were estimated, followed by the calculation of principal components and subsequent application of UMAP and t-SNE with varying parameter settings, as detailed in Uzel et al. (2025). All intermediate and input files generated from these datasets are available here. Genotype likelihood estimations are provided in the formats '.beagle.gz' and '.mafs.gz'. Additionally, the repository contains the input covariance matrix ('.cov') for each dataset and the population information file for each group, which were employed in the non-linear dimensionality reduction steps described in Uzel et al. (2025).
All raw sequencing data we used in this study were downloaded from public databases, and no new data were generated.
The scimitar-horned oryx data were acquired from NCBI BioProject PRJEB37295 (Humble et al. 2023)
All bioinformatic codes used for generating the results and guidelines presented in Çilingir et al. (2024) are available at https://github.com/fgcilingir/lcUMAPtSNE.
Humble, E., Stoffel, M. A., Dicks, K., Ball, A. D., Gooley, R. M., Chuven, J., Pusey, R., Remeithi, M. A., Koepfli, K.-P., Pukazhenthi, B., Senn, H., & Ogden, R. (2023). Conservation management strategy impacts inbreeding and mutation load in scimitar-horned oryx. Proceedings of the National Academy of Sciences of the United States of America, 120(18), e2210756120.
Uzel, K., Grossen, C., Çilingir, F.G. (2025) lcUMAPtSNE: Use of non-linear dimensionality reduction techniques with genotype likelihoods. bioRxiv, https://doi.org/10.1101/2024.04.01.587545.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two exemplary datasets (RNA-5c, RNA-5c-lowH1975) are available. Both datasets include the human lung adenocarcinoma cell lines A549, H1975, H2228, H838, and HCC827. RNA-5c, RNA-5c-lowH1975 are derived from GEO repository ID: GSM3618014. Both dataset count tables have associated with cell names and cell lines, e.g. Lib90_00000.HCC827, Lib90_00002.H838. RNA-5c includes 1242 cells of A549, 436 cells belong to H1975, 749 cells of H2228, 879 cells belong to H838, and 598 cells of HCC827 cells. To draw attention to a dataset characterized by a cell population down represented, the dataset RNA-5c-lowH1975 is created. RNA-5c-lowH1975 includes all the RNA-5c cells but only 50 cells of H1975 cells.
@font-face {font-family:Helvetica; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:0; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:-536870145 1342208091 0 0 415 0;}@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-536869121 1107305727 33554432 0 415 0;}p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman",serif; mso-fareast-font-family:"Times New Roman";}a:link, span.MsoHyperlink {mso-style-priority:99; color:#0563C1; mso-themecolor:hyperlink; text-decoration:underline; text-underline:single;}a:visited, span.MsoHyperlinkFollowed {mso-style-noshow:yes; mso-style-priority:99; color:#954F72; mso-themecolor:followedhyperlink; text-decoration:underline; text-underline:single;}.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-fareast-language:EN-US;}div.WordSection1 {page:WordSection1;}
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The 21 dataset used to generate the results from the paper
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell RNA sequencing (scRNA-seq) is a high-throughput sequencing technology performed at the level of an individual cell, which can have a potential to understand cellular heterogeneity. However, scRNA-seq data are high-dimensional, noisy, and sparse data. Dimension reduction is an important step in downstream analysis of scRNA-seq. Therefore, several dimension reduction methods have been developed. We developed a strategy to evaluate the stability, accuracy, and computing cost of 10 dimensionality reduction methods using 30 simulation datasets and five real datasets. Additionally, we investigated the sensitivity of all the methods to hyperparameter tuning and gave users appropriate suggestions. We found that t-distributed stochastic neighbor embedding (t-SNE) yielded the best overall performance with the highest accuracy and computing cost. Meanwhile, uniform manifold approximation and projection (UMAP) exhibited the highest stability, as well as moderate accuracy and the second highest computing cost. UMAP well preserves the original cohesion and separation of cell populations. In addition, it is worth noting that users need to set the hyperparameters according to the specific situation before using the dimensionality reduction methods based on non-linear model and neural network.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The data set presented here contains the MAU% data for the selected hyena-made and leopard-made faunal assemblages with which the Misiam assemblage is compared. Misiam is a recently discovered modern faunal accumulation found at Olduvai Gorge (Tanzania) interpreted as a palimpsest resulting from the action of leopards (main transporting agents) and hyenas (secondary scavengers). It is the first open-air reported leopard-made faunal accumulation. Defining the anatomical and taphonomic characteristics of such an assembllage is important for the interpretation of prehistoric faunal assemblages created by carnivores. It is also relevant for modern ecological studies. In this particular case, the bulk of the assemblage is composed of wildebeests. This is usually not the target of leopards; however, their seasonal abundance during the wildebeest migration on the plains adjacent to Olduvai Gorge prompts this rather exceptional highly-specialized behavior by usually eclectic leopards. In the present work, a thorough taphonomic analysis is carried out and the main taxonomic, anatomical and taphonomic characteristics of this felid-hyenic modified assemblage is decribed. The analytical approach adopted uses the data presented here. Methods The Misiam data were collected in the field. The bone assemblage lay on the surface of a densely-vegetated ravine. Bones were simply collected and in one particular area an excavation was m,ade to retrieve bones sub-surficially, In order to compare skeletal profiles in felid and hyenid assemblages, we will use some of the most representative assemblages in the literature. For spotted hyena dens, we will use data from the Koobi Fora Hyena Den 1 (KFHD1) , the Amboseli den, the Maasai Mara den, and the Syokimau den, all of them in Kenya, and the Eyasi (Kisima Ngeda) Hyena Den 2 (KND2) (Tanzania). We used these assemblages also because they are either dominated by size 3 carcasses or these make up a significant part of the assemblage.
When comparing long bone shaft breakage patterns, we also used additional hyena-made assemblages: Dumali, Heraide, Yangula Ari, Oboley (spotted hyenas), Datagabou (striped hyena, Djibouti), and Uniab (brown hyena, Namibia). These assemblages are almost completely dominated by very small fauna (Capra hircus), and several of them constitute significantly smaller sample sizes than the hyena dens mentioned above.
The leopard lairs used for comparison are: Portsmut and Hakos River (Namibia), and WU/BA-001 (South Africa). Portsmut and Hakos River show a low density of remains, probably also modified by porcupines or other agents. The remains belonging to larger animals show an interesting contrast with those documented in hyena dens: the presence of axial and compact bones is high. These latter bones are also well represented in smaller carcasses. This characteristic is more marked in WU/BA-001; the least altered leopard lair documented to date. This lair was monitored for 7 years.
All the comparative assemblages were transformed into %MAU to account for differential inter-assemblage quantitative representation. First, they were analyzed using Generalized Low Rank Models (GLRM) as an exploratory method. Then, we used a Uniform Manifold Approximation and Projection (UMAP), to classify leopards´ and hyenas´ bone assemblages, especially according to each feature. Lastly, we used a cluster analysis with variance-dependent phylogenetic tree to show the actual distances among all the assemblages compared.
GLRM are a series of methods for dimensionality reduction that use several loss function types and can implement regularization functions. Whereas principal component analysis (PCA) is based on orthogonal projections of linear relationships, in cases where relationships are non-linear, the PCA underperforms compared to other more flexible methods. GLRM decomposes a table into two distinctive matrices X and Y. X contains the same number of rows as the original table, but all variables are condensed into k factors. Y has k rows and the same number of columns as features (i.e., variables) in the original table. Each of the rows is an archetypal feature derived from the columns (i.e., variables) of the original table. Each row of X corresponds to a row of the original table projected into this reduced dimension feature space. Data are compressed by the low-rank representation derived from k feature reduction. An advantage of GLRM over PCA is that it can handle mixed datasets containing numeric, categorical and Boolean data. GLRM admits several types of loss functions: Huber, Poisson, quadratic, periodic or hinge. It also allows the use of regularization functions, including: Lasso, Ridge, OneSparse, Simplex, UnitOneSparse, and quadratic. Loss functions are used to select the optimal archetypal values. Regularization is used to limit X and Y archetypal values. This impacts the effect of negative data, multicollinearity and overfitting. In the present analysis, GLRM was performed with the “h2o” R library (www.r-project.org).
UMAPs is a non-linear dimension-reduction method based on finding inter-case distances in a low-dimensional feature space. The key of UMAP over other dimension-reduction non-linear methods, like t-distributed stochastic neighbor embedding (t-SNE), is that distances are generated along a “manifold”. A manifold is a n-dimensional geometric shape constituted of the path(s) among the points. Every point is referenced according to a small two-dimensional neighborhood around it. The UMAP algorithm searches for a multi-dimensional space delimited by the location of points. UMAP uses a nearest-neighbor approach, by eventually connecting all the points along its search regions. This forces a uniform distribution of points. The distances of points along this manifold are then derived through Euclidean distances. Several optimization methods can be used to reproduce inter-point distances. For the latter process, the UMAP approach that we will use is based on a cross-entropy loss function. For the UMAP analysis, we have used the “umap” R library (www.r-project.org). We have also used a search grid combining ranges of values for number of neighbors, minimal distance between neighbors, distance metric, and number of epochs (i.e., iterations of the optimization process).
Finally, a hierarchical cluster analysis, using an Euclidean distance matrix on the %MAU dataset, was carried out. The method used was the “average” linkage, which represents the average distance between the points. The combination of the three methods was used to study agent-specific variability in inter-assemblage element representation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset name, reference, dimensions and cell type composition.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset ini berisi vektor embedding berdimensi 32 yang dihasilkan dari judul dan abstrak artikel ilmiah yang diambil dari portal Garuda (https://garuda.kemdikbud.go.id) menggunakan model Qwen3-Embedding-0.6B. Vektor asli berdimensi 1024 telah dipangkas menjadi 32 dimensi menggunakan Matryoshka Representation Learning (MRL) untuk mengoptimalkan penyimpanan sambil tetap mempertahankan informasi semantik. Semua vektor telah dinormalisasi (L2 normalization) untuk memastikan konsistensi.
Catatan:
- Dataset ini hanya berisi vektor. Data teks bisa didapat di dataset Artikel Jurnal Garuda dengan menyamakan atributid
. - Varian 512: Garuda Journal Embedding (512) - Varian 1024: Garuda Journal Embedding
Dataset ini dapat digunakan untuk berbagai aplikasi pemrosesan bahasa alami (NLP), seperti: - Pengelompokan (clustering) artikel berdasarkan topik. - Pencarian artikel serupa berdasarkan kesamaan semantik. - Analisis tren penelitian di Indonesia.
import pickle
with open('dataset.pkl', 'rb') as f:
data = pickle.load(f)
print(data[0]) # Contoh: {'id': 'uuid123', 'vectors': [0.1, 0.2, ..., 0.3]}
Untuk contoh lebih lanjut, lihat notebook Kaggle yang disertakan di dataset ini.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data accompanying Velten, Story, Hernandez et al., 2021. zip file containing this README and MutaSeq.final.seurat.RDS - a single Seurat (v3.1.4) object containing the following ASSAYS:* RNA: Raw count data (slot: counts), data normalized individually for each patient using default seurat routines (slot: data)* integrated: Scanorama corrected data.* FACS: FACS index values for surface antigens. Logicle transformation was applied where appropriate* counts.mutant.P1: For all mutations analyzed for patient P1, counts of the MUTANT allele. Columns starting with X correspond to mitochondrial sites.* counts.reference.P1: For all mutations analyzed for patient P1, counts of the REFERENCE allele.* PhiSICS.Likelihood.P1: For each cell, likelihood to attach to a given node in the phylogenetic tree (see manuscript figure 2).* PhiSCS.Summarised.P1: Likelihood summarised by main clones (i.e. leukemic.KLF7, leukemic.CEBPA, preleukemic, T.cell.clone, non-leukemic)* counts.mutant.P2: For all mutations analyzed for patient P2, counts of the MUTANT allele. Columns starting with X correspond to mitochondrial sites.* counts.reference.P2: For all mutations analyzed for patient P2, counts of the REFERENCE allele.* PhiSICS.Likelihood.P2: For each cell, likelihood to attach to a given node in the phylogenetic tree (see manuscript figure 2).* PhiSCS.Summarised.P2: Likelihood summarised by main clones (i.e. leukemic, preleukemic, non-leukemic)and the following REDUCTIONS:* scanorama: Scanorama result.* tsne: Computed from scanorama.* umap: Computed from scanorama.and the following METADATA columns:* nCount_RNA: Number of reads mapping to exons.* nFeature_RNA: Number of genes observed.* Plate: Processing plate* patient: Patient* mainClone: Maximum likelihood estimate of the clone the cell belongs to. Also includes estimated clone for P3 and P4, see manuscript figure 4.* Cancer: Boolean variable to distinguish cells that are likely leukemic/preleukemic (TRUE) from other cells (FALSE). Cells with no observations from P3 and P4 are NA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This link consists of 10 anonymized non-small cell lung cancer (NSCLC) field of Views (FoVs) to test Mistic.
Mistic
Understanding the complex ecology of a tumor tissue and the spatio-temporal relationships between its cellular and microenvironment components is becoming a key component of translational research, especially in immune-oncology. The generation and analysis of multiplexed images from patient samples is of paramount importance to facilitate this understanding. In this work, we present Mistic, an open-source multiplexed image t-SNE viewer that enables the simultaneous viewing of multiple 2D images rendered using multiple layout options to provide an overall visual preview of the entire dataset. In particular, the positions of the images can be taken from t-SNE or UMAP coordinates. This grouped view of all the images further aids an exploratory understanding of the specific expression pattern of a given biomarker or collection of biomarkers across all images, helps to identify images expressing a particular phenotype or to select images for subsequent downstream analysis. Currently there is no freely available tool to generate such image t-SNEs.
Links
Mistic code
Mistic documentation
Paper