40 datasets found
  1. Cluster tendency assessment in neuronal spike data

    • plos.figshare.com
    pdf
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante (2023). Cluster tendency assessment in neuronal spike data [Dataset]. http://doi.org/10.1371/journal.pone.0224547
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.

  2. Z

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    • data.niaid.nih.gov
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Curtò, J.; de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857
    Explore at:
    Dataset updated
    Jan 19, 2022
    Dataset provided by
    Universitat Oberta de Catalunya
    Authors
    de Curtò, J.; de Zarzà, I.
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

    Repository: https://github.com/decurtoidiaz/drcyz

    Subset of samples from (includes tools to visualize and analyse the dataset):

    CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

    Images from NASA missions of the celestial body.

    Repository: https://github.com/decurtoidiaz/cyz

    Authors:

    J. de Curtò c@decurto.be

    I. de Zarzà z@dezarza.be

    File Information from DrCyZ-1.1

    • Subset of samples from Perseverance (drcyz/c).
      ∙ png (drcyz/c/png).
        PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
      ∙ csv (drcyz/c/csv).
        CSV file.
    
    
    • Resized samples from Perseverance (drcyz/c+).
      ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
        PNG files resized at the corresponding size. 
      ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
        TFRecord resized at the corresponding size to import on Tensorflow.
    
    
    • Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
      ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
        PNG files subset of 100, 1000 and 10000 at size 256x256.
    
    
    • Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
      ∙ network-snapshot-000798-drcyz.pkl
    
    
    • Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
      ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Curiosity.
      ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Perseverance.
      ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
      ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
      ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
        Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
      ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
        Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
      ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
        Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
    
  3. Additional file 4 of GECO: gene expression clustering optimization app for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 4 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642379.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    A. N. Habowski; T. J. Habowski; M. L. Waterman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.

  4. Additional file 5 of GECO: gene expression clustering optimization app for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. N. Habowski; T. J. Habowski; M. L. Waterman (2023). Additional file 5 of GECO: gene expression clustering optimization app for non-linear data visualization of patterns [Dataset]. http://doi.org/10.6084/m9.figshare.13642382.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    A. N. Habowski; T. J. Habowski; M. L. Waterman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.

  5. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  6. d

    Replication Data for the \"Keratoconus severity identification using...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousefi, Siamak (2023). Replication Data for the \"Keratoconus severity identification using unsupervised machine learning\", Siamak Yousefi, Ebrahim Yousefi, Hidenori Takahashi, Takahiko Hayashi, Hironobu Tampo, Satoru Inoda, Yusuke Arai, and Penny Asbell, PLOS One 2018 [Dataset]. http://doi.org/10.7910/DVN/G2CRMO
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Yousefi, Siamak
    Description

    Dataset and labels for the article of Keratoconus severity identification using unsupervised machine learning by Siamak Yousefi

  7. Z

    Initial results in dimensionality reduction of taxi DropOut-PickUp regions

    • data.niaid.nih.gov
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Gregurić (2023). Initial results in dimensionality reduction of taxi DropOut-PickUp regions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8089805
    Explore at:
    Dataset updated
    Jun 28, 2023
    Dataset provided by
    University of Zagreb Faculty of Transport and Traffic Sciences
    Authors
    Martin Gregurić
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial results with respect to dimensionality reduction of taxi PickUp-DropOut regions from New York City, Manhattan region, YellowCab company (2018 year, first 7 months). The dimensionality reduction is done separately for all working days and weekends using t-SNE, an SVD, and a simple deep autoencoder. The clustering quality assessment in two-dimensional space in which dimensionality reduction is done is conducted by using Silhouette, Calinski-Harabasz, and Davies-Bouldin metrics. Furthermore, the 15-minute taxi data aggregation is used.

  8. Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fei Xia; Xiaojun Xie; Zongqin Wang; Shichao Jin; Ke Yan; Zhiwei Ji (2023). Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype Discovery of Plant With Lesion.XLSX [Dataset]. http://doi.org/10.3389/fpls.2021.789630.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Fei Xia; Xiaojun Xie; Zongqin Wang; Shichao Jin; Ke Yan; Zhiwei Ji
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Plants are often attacked by various pathogens during their growth, which may cause environmental pollution, food shortages, or economic losses in a certain area. Integration of high throughput phenomics data and computer vision (CV) provides a great opportunity to realize plant disease diagnosis in the early stage and uncover the subtype or stage patterns in the disease progression. In this study, we proposed a novel computational framework for plant disease identification and subtype discovery through a deep-embedding image-clustering strategy, Weighted Distance Metric and the t-stochastic neighbor embedding algorithm (WDM-tSNE). To verify the effectiveness, we applied our method on four public datasets of images. The results demonstrated that the newly developed tool is capable of identifying the plant disease and further uncover the underlying subtypes associated with pathogenic resistance. In summary, the current framework provides great clustering performance for the root or leave images of diseased plants with pronounced disease spots or symptoms.

  9. f

    Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Xiaoyu; Lin, Zhuhu; Chen, Meilin; Tong, Xian; Li, Jianhao; Zhu, Qi; Duo, Tianqi; Li, Enru; Cai, Shufang; Liu, Tongni; Liu, Xiaohong; Xu, Rong; Mo, Delin; Chen, Yaosheng; Hu, Bin; Liang, Ziyun (2023). Additional file 1 of Integrative single-cell RNA-seq and ATAC-seq analysis of myogenic differentiation in pig [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001023879
    Explore at:
    Dataset updated
    Apr 13, 2023
    Authors
    Wang, Xiaoyu; Lin, Zhuhu; Chen, Meilin; Tong, Xian; Li, Jianhao; Zhu, Qi; Duo, Tianqi; Li, Enru; Cai, Shufang; Liu, Tongni; Liu, Xiaohong; Xu, Rong; Mo, Delin; Chen, Yaosheng; Hu, Bin; Liang, Ziyun
    Description

    Additional file 1: Figure S1. Quality control and batch effect correction in scRNA-Seq, related to Figure 1 A. Violin plots showing the number of expressed genes, the number of reads uniquely mapped against the reference genome, and the fraction of mitochondrial genes compared to all genes per cell in scRNA-Seq data. B. Box plot showing the number of genes (left) and the number of uniquely mapped reads (right) per cell in each identified cell type in scRNA-Seq data. C. tSNE plot visualization of the sample source for all 70,201 cells. Each dot is a cell. Different colors represent different samples. D. tSNE plot visualization of unsupervised clustering analysis for all 70,201 cells based on scRNA-Seq data after quality control, which gave rise to 31 distinct clusters. Figure S2. Gene Ontology (GO) analysis of the DEGs for each cell type was performed and the representative enriched GO terms are presented, related to Figure 1. Figure S3. Expression of selected marker genes along the differentiation trajectory, related to Figure 2 A. tSNE plot demonstrating cell cycle regression (left). Visualization of myogenic differentiation trajectory by cell cycle phases (G1, S, and G2/M) (right). B. Donut plots showing the percentages of cells in G1, S, and G2M phase at different cell states. C. Expression levels of cell cycle-related genes in the myogenic cells organized into the Monocle trajectory. D. Expression levels of muscle related genes in the myogenic cells organized into the Monocle trajectory. Figure S4. Unsupervised clustering analysis for all cells in scATAC-Seq data and myogenic-specific scATAC-seq peaks, related to Figure 4 A-C. tSNE plot visualization of the sample source for all 48514 cells in scATAC-Seq. Each dot is a cell. Different colors represent different pigs (A), different embryonic stages (B), or different samples (C). D. tSNE plot visualization of unsupervised clustering analysis for all 48514 cells after quality control in scATAC-Seq data, which gave rise to 15 distinct clusters. E. tSNE plot visualization of myogenic cells and other cells. Clusters 4 and 8 in Figure S4D were annotated as myogenic cells due to their high levels of accessibility of marker genes associated with myogenic lineage. F. Genome browser view of myogenic-specific peaks at the TSS of MyoG and Myf5 for myogenic cells and other cells in the scATAC-seq dataset. Figure S5. Percentage distribution of open chromatin elements in scATAC-Seq data, related to Figure 4 A. Distribution of open chromatin elements in each snATAC-seq sample. B. Distribution of open chromatin elements in snATAC-seq of myogenic cell types. C. Percentage distribution of open chromatin elements among DAPs in myogenic cell types. Figure S6. Integrative analysis of transcription factors and target genes, related to Figure 5 A. tSNE depiction of regulon activity (“on-blue”, “off-gray”), TF gene expression (red scale), and expression of predicted target genes (purple scale) of MyoG, FOSB, and TCF12. B. Corresponding chromatin accessibility in scATAC data for TFs and predicted target genes are depicted. Figure S7. Pseudotime-dependent chromatin accessibility and gene expression changes, related to Figure 7. The first column shows the dynamics of the 10× Genomics TF enrichment score. The second column shows the dynamics of TF gene expression values, and the third and fourth columns represent the dynamics of the SCENIC-reported target gene expression values of corresponding TFs, respectively. Figure S8. Myogenesis related gene expression in DMD (Duchenne muscular dystrophy) mice. Comparison of RNA-seq data of flexor digitorum short (FDB), extensor digitorum long (EDL), and soleus (SOL) in DMD and wild-type mice including 2- month and 5-month age. A. The expression levels of myogenesis related genes (Myod1, Myog, Myf5, Pax7). B. The expression levels of related genes that were upregulated during porcine embryonic myogenesis (EGR1, RHOB, KLF4, SOX8, NGFR, MAX, RBFOX2, ANXA6, HES6, RASSF4, PLS3, SPG21). C. The expression levels of related genes that were downregulated during porcine embryonic myogenesis COX5A, HOMER2, BNIP3, CNCS). Data were obtained from the GEO database (GSE162455; WT, n = 4; DMD, n = 7). Figure S9. Genome browser view of differentially accessible peaks at the TSS of EGR1 and RHOB between myogenic cells in the scATAC-seq dataset, related to Figure 8. Figure S10. Functional analysis of EGR1 in myogenesis, related to Figure 8 A-B. EdU assays for the proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following EGR1 overexpression. C. qPCR analysis of the mRNA levels of cell cycle regulators in C2C12 cells following EGR1 overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following EGR1 overexpression and differentiation for 3 d. Then, the fusion index was calculated. Figure S11. Functional analysis of RHOB in myogenesis, related to Figure 8 A-B. EdU assays for proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following RHOB overexpression. C. qPCR analysis of the mRNA levels of cell-cycle regulators in C2C12 cells following RHOB overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following RHOB overexpression and differentiation for 3 d. Then, the fusion index was calculated.

  10. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

  11. f

    Additional file 2 of Effects of caloric restriction on the gut microbiome...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Apr 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brachs, Sebastian; Glauben, Rainer; Bisanz, Jordan E.; Volk, Hans-Dieter; von Schwartzenberg, Reiner Jumpertz; Sandforth, Arvid; Drechsel, Oliver; Radonić, Aleksandar; Turnbaugh, Peter J.; Kunkel, Désirée; Sbierski-Kind, Julia; Grenkowitz, Sophia; Spranger, Joachim; Friedrich, Marie; Schlickeiser, Stephan; Thürmer, Andrea; Mai, Knut (2022). Additional file 2 of Effects of caloric restriction on the gut microbiome are linked with immune senescence [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000418773
    Explore at:
    Dataset updated
    Apr 5, 2022
    Authors
    Brachs, Sebastian; Glauben, Rainer; Bisanz, Jordan E.; Volk, Hans-Dieter; von Schwartzenberg, Reiner Jumpertz; Sandforth, Arvid; Drechsel, Oliver; Radonić, Aleksandar; Turnbaugh, Peter J.; Kunkel, Désirée; Sbierski-Kind, Julia; Grenkowitz, Sophia; Spranger, Joachim; Friedrich, Marie; Schlickeiser, Stephan; Thürmer, Andrea; Mai, Knut
    Description

    Additional file 1: Supplementary Figure 1. Related to Fig. 1. MMSP50 is a representative donor of the weight loss cohort. A Examination of baseline alpha diversity demonstrates that MMSP50 is at the 54th ranked percentile for baseline diversity after VLCD. B Their baseline microbiota composition (principal coordinates analysis of Bray-Curtis Dissimilarity) is well within the 95% confidence interval of baseline composition for the cohort (dotted line) and C their change in community structure is the 19th percentile for change in composition. Supplementary Figure 2. Related to Fig. 2. No significant changes in energy loss or fecal content after microbial colonization. Metabolic analysis of germ-free (GF) mice and mice inoculated with the AdLib and CalRes human gut microbiota. A-D Energy loss (A), fecal energy content (B), food consumption (C), and energy absorption (D) were measured using bomb calorimetry in GF and colonized mice. E Body weights in g. ** P < 0.01, *** P < 0.001 as determined using 2-way ANOVA with Bonferonni’s post-test correction for multiple comparisons. error bars = SEM. Supplementary Figure 3. Related to Fig. 3. Differential expression of surface markers in different colonic immune cell clusters of germ-free and colonized mice. A The heatmap shows differentially distributed colonic immune cell phenotypes quantified by PhenoGraph clustering. The distribution of each cell cluster (rows) is shown for each murine sample (columns). B The heatmap shows the distribution of colonic immune lineages based on the expression of canonical lineage markers by t-SNE on all colonic viable CD45+ leukocytes. The differential expression of each selected surface marker (rows) is shown for each immune cell cluster (columns). The significance levels of the comparison between the groups for each immune cell cluster are depicted by semi-supervised hierarchical clustering. The top bubbles denote clusters with significantly different abundances between the groups. Bubble colors indicate the one of the two groups being compared with higher average cellular frequencies; bubble size indicates the -log2 FDR-adjusted p-values. Visualization of all colonic viable CD45+ leukocytes by t-SNE. Overlayed colors represent Phenograph clusters as defined in heatmap. C-G Absolute numbers of colonic leukocytes (C), CD4+ T cells (D), CD8+ T cells (E), B cells (F), and NK cells (G) defined by manual gating of mass cytometry data, from germ-free (GF) mice and mice colonized with the AdLib and CalRes human gut microbiota from the top weight loser of an 8-week weight loss intervention study (n=9 or more mice per group). * P<0.05, ANOVA with Bonferonni’s post-test correction for multiple comparison. Supplementary Figure 4. Related to Fig. 4. Differential expression of surface markers in different splenic immune cell clusters of germ-free and colonized mice. A The heatmap shows differentially distributed splenic immune cell phenotypes quantified by PhenoGraph clustering. The distribution of each cell cluster (rows) is shown for each murine sample (columns). B The heatmap shows the distribution of splenic immune lineages based on the expression of canonical lineage markers by t-SNE on all colonic viable CD45+ leukocytes. The differential expression of each selected surface marker (rows) is shown for each immune cell cluster (columns). The significance levels of the comparison between the groups for each immune cell cluster are depicted by semi-supervised hierarchical clustering. The top bubbles denote clusters with significantly different abundances between the groups. Bubble colors indicate the one of the two groups being compared with higher average cellular frequencies; bubble size indicates the -log2 FDR-adjusted p-values. Visualization of all splenic viable CD45+ leukocytes by t-SNE. Overlayed colors represent Phenograph clusters as defined in heatmap. C-G Absolute numbers of splenic leukocytes (C), CD4+ T cells (D), CD8+ T cells (E), B cells (F), and NK cells (G) defined by manual gating of mass cytometry data, from germ-free (GF) mice and mice colonized with the AdLib and CalRes human gut microbiota from the top weight loser of an 8-week weight loss intervention study (n=9 or more mice per group). * P<0.05, ANOVA with Bonferonni’s post-test correction for multiple comparison. Supplementary Figure 5. Related to Fig. 5. Differential expression of surface markers in different hepatic immune cell clusters of germ-free and colonized mice. A The heatmap shows differentially distributed hepatic immune cell phenotypes quantified by PhenoGraph clustering. The distribution of each cell cluster (rows) is shown for each murine sample (columns). B The heatmap shows the distribution of hepatic immune lineages based on the expression of canonical lineage markers by t-SNE on all colonic viable CD45+ leukocytes. The differential expression of each selected surface marker (rows) is shown for each immune cell cluster (columns). The significance levels of the comparison between the groups for each immune cell cluster are depicted by semi-supervised hierarchical clustering. The top bubbles denote clusters with significantly different abundances between the groups. Bubble colors indicate the one of the two groups being compared with higher average cellular frequencies; bubble size indicates the -log2 FDR-adjusted p-values. Visualization of all hepatic viable CD45+ leukocytes by t-SNE. Overlayed colors represent Phenograph clusters as defined in heatmap. C-G Absolute numbers of hepatic leukocytes (C), CD4+ T cells (D), CD8+ T cells (E), B cells (F), and NK cells (G) defined by manual gating of mass cytometry data, from germ-free (GF) mice and mice colonized with the AdLib and CalRes human gut microbiota from the top weight loser of an 8-week weight loss intervention study (n=9 or more mice per group). ANOVA with Bonferonni’s post-test correction for multiple comparison. Supplementary Figure 6. Related to Fig. 6. Gut microbial community structure slightly affects composition and activation of liver immune cells. The heatmap shows latent correlation matrix between abundances of amplicon sequence variants (ASVs) detected in stool samples and all immune parameters analyzed in liver of mice 21 days after inoculation with AdLib and CalRes human gut microbiota. Immune parameters are expressed as frequencies, i.e., percent of parent, except those labeled # which were quantified as absolute cell counts. Heatmap was ordered according to rows and columns first principal components to highlight the cross-correlation structure. Asterisks indicate variables that were selected in L1-penalized sparse canonical correlation analysis (CCA). Circular chord plots display latent correlation between frequencies of manually defined immune subsets and L1-selected ASVs including the top ten taxa that either positively or negatively associate with the immunological dataset. Blue to red colour scale in heatmap and chords indicates negative and positive correlation values. Color of row-legend bar and species labels denotes the phylum level. Colors of column legend bars indicate parental lineage and differentiation level (antigen-experience) of lymphocyte subsets, respectively. The boxplot inset shows how experimental groups as a latent variable are not well-explained by the sparse canonical covariate.

  12. u

    Data from: A Benchmark Dataset for Multilingual Tokenization Energy and...

    • observatorio-cientifico.ua.es
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quesada Granja, Carlos; Quesada Granja, Carlos (2025). A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages [Dataset]. https://observatorio-cientifico.ua.es/documentos/688b604617bb6239d2d4a92a
    Explore at:
    Dataset updated
    2025
    Authors
    Quesada Granja, Carlos; Quesada Granja, Carlos
    Description

    This repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages.

    The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation.

    The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods.

    Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included.

    All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community.

    01_processing_scripts/

    R scripts to transform raw data, subtract baseline energy, and produce clean metrics.

    multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time.

    adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files.

    energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations.

    02_raw_data/

    Raw output from the tokenization experiment and baseline profiler.

    all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks).

    baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization.

    03_clean_data/

    Cleaned, enriched, and reshaped datasets ready for analysis.

    net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run).

    tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts.

    tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric.

    complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions.

    metadata.csv⤷ Structural features and script-based character stats per chunk.

    04_cluster_outputs/

    Outputs from clustering and dimensionality reduction over tokenizer energy profiles.

    tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles.

    tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage.

    mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster.

    sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster.

    grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.

  13. American Anxieties: Dear Abby's Questions

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). American Anxieties: Dear Abby's Questions [Dataset]. https://www.kaggle.com/thedevastator/american-anxieties-dear-abby-s-questions
    Explore at:
    zip(6547914 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    The Devastator
    Description

    American Anxieties: Dear Abby's Questions

    20,000 Questions to Dear Abby: Insights on American Anxieties

    By Kelly Garrett [source]

    About this dataset

    This dataset is a collection of 20,000 questions addressed to the advice columnist Dear Abby, providing valuable insights into American anxieties and concerns from the mid-1980s to 2017. It was used in The Pudding essay titled 30 Years of American Anxieties: What 20,000 letters to an advice columnist tell us about what—and who—concerns us most, published in November 2018.

    The dataset includes information such as the URL and title of the articles or publications where the questions were published. It also contains the text of the questions asked by readers. These questions were publicly available on websites as well as obtained from digital copies of newspapers that included Dear Abby sections.

    It is important to note that this dataset does not include any updates.

    The writers of these questions are predominantly female (approximately two-thirds) based on demographics mentioned by Pauline Phillips, which were collected through a survey she conducted in 1987. However, there is limited information available about their origins or other demographic data. Additionally, it should be acknowledged that only a fraction of all written-in questions were published because advice columnists selectively choose which ones to feature.

    Despite these limitations, this dataset offers a glimpse into important societal concerns over time. For instance, it reflects issues like the HIV/AIDS crisis during the 1980s. With over 20,000 questions spanning several decades, it provides a directional understanding of broader trends.

    The essay based on this dataset highlighted three main themes: sex, LGBTQ issues, and religion. To analyze these topics further, relevant keywords were used for each issue to create broad groupings and then narrow down into specific categories.

    In addition to these themes, questions related to parents, children friends and bosses were also explored using a visual clustering technique called t-SNE (t-distributed Stochastic Neighbor Embedding). Manual categorization was also employed by tagging relevant entries within those groupings.

    It's important to note that the dataset does not include any information about the dates of publication or data collection.

    Overall, this dataset provides a comprehensive view of American anxieties and concerns over several decades, offering insights into cultural shifts and societal issues

    How to use the dataset

    • Understanding the Columns: The dataset consists of several columns that provide valuable information about each question.

      • question_only: This column contains the text of the question asked by the reader.
      • title: This column contains the title or headline of the article or publication where the question was published.
      • url: This column contains the URL of the article or publication where the question was published.
    • Analyzing Specific Topics: The dataset covers various topics that were concerning Americans during this time period. You can use specific keywords related to these topics in combination with text analysis techniques to gain insights into public concerns and attitudes.

      Common themes covered in this dataset include sex, LGBTQ issues, religion, parents, children, friends, and bosses.

    • Keyword Analysis: To analyze specific topics or themes within this dataset effectively, it is recommended to create a list of relevant keywords related to your research interests. These keywords can be used for filtering and searching within columns like question_only or title.

    • Text Analysis Techniques: You can apply various text analysis techniques on this dataset such as sentiment analysis, topic modeling (using methods like Latent Dirichlet Allocation), word frequency analysis, sentiment analysis etc., based on your research goals.

    • Visualizations and Clustering: If you are interested in exploring patterns or trends in relationship between different variables present in each question (such as topic clustering), you can utilize visualization techniques like t-SNE to create visual representations.

      You can also apply other clustering algorithms or network analysis techniques to gain additional insights from the dataset.

    Remember to acknowledge the source when using this dataset: The Pudding essay 30 Years of American Anxieties: What 20,000 letters to an advice columnist tell us about what—and who—concerns us most. published in November 2018.

    Plea...

  14. Details of overall clustering results.

    • plos.figshare.com
    xls
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahneela Pitafi; Toni Anwar; I Dewa Made Widia; Zubair Sharif; Boonsit Yimwadsana (2024). Details of overall clustering results. [Dataset]. http://doi.org/10.1371/journal.pone.0313890.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shahneela Pitafi; Toni Anwar; I Dewa Made Widia; Zubair Sharif; Boonsit Yimwadsana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.

  15. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  16. m

    GERDA datasets including NGS and SGA data

    • data.mendeley.com
    Updated Apr 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Otte (2023). GERDA datasets including NGS and SGA data [Dataset]. http://doi.org/10.17632/8c4zbxfvwk.3
    Explore at:
    Dataset updated
    Apr 26, 2023
    Authors
    Fabian Otte
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets linked to publication "Revealing viral and cellular dynamics of HIV-1 at the single-cell level during early treatment periods", Otte et al 2023 published in Cell Reports Methods pre-ART (antiretroviral therapy) cryo-conserved and and whole blood specimen were sampled for HIV-1 virus reservoir determination in HIV-1 positive individuals from the Swiss HIV Study Cohort. Patients were monitored for proviral (DNA), poly-A transcripts (RNA), late protein translation (Gag and Envelope reactivation co-detection assay, GERDA) and intact viruses (golden standard: viral outgrowth assay, VOA). In this dataset we deposited the pipeline for the multidimensional data analysis of our newly established GERDA method, using DBScan and tSNE. For further comprehension NGS and Sanger sequencing data were attached as processed and raw data (GenBank).

    Resubmitted to Cell Reports Methods (Jan-2023), accepted in principal (Mar-2023)

    GERDA is a new detection method to decipher the HIV-1 cellular reservoir in blood (tissue or any other specimen). It integrates HIV-1 Gag and Env co-detection along with cellular surface markers to reveal 1) what cells still contain HIV-1 translation competent virus and 2) which marker the respective infected cells express. The phenotypic marker repertoire of the cells allow to make predictions on potential homing and to assess the HIV-1 (tissue) reservoir. All FACS data were acquired on a LSRFortessa BD FACS machine (markers: CCR7, CD45RA, CD28, CD4, CD25, PD1, IntegrinB7, CLA, HIV-1 Env, HIV-1 Gag) Raw FACS data (pre-gated CD4CD3+ T-cells) were arcsin transformed and dimensionally reduced using optsne. Data was further clustered using DBSCAN and either individual clusters were further analyzed for individual marker expression or expression profiles of all relevant clusters were analyzed by heatmaps. Sequences before/after therapy initiation and during viral outgrowth cultures were monitored for individuals P01-46 and P04-56 by Next-generation sequencing (NGS of HIV-1 Envelope V3 loop only) and by Sanger (single genome amplification, SGA)

    data normalization code (by Julian Spagnuolo) FACS normalized data as CSV (XXX_arcsin.csv) OMIQ conText file (_OMIQ-context_XXX) arcsin normalized FACS data after optsne dimension reduction with OMIQ.ai as CSV file (XXXarcsin.csv.csv) R pipeline with codes (XXX_commented.R) P01_46-NGS and Sanger sequences P04_56-NGS and Sanger sequences

  17. List of augmentations selected.

    • plos.figshare.com
    xls
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahneela Pitafi; Toni Anwar; I Dewa Made Widia; Zubair Sharif; Boonsit Yimwadsana (2024). List of augmentations selected. [Dataset]. http://doi.org/10.1371/journal.pone.0313890.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shahneela Pitafi; Toni Anwar; I Dewa Made Widia; Zubair Sharif; Boonsit Yimwadsana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.

  18. Details of PID image dataset.

    • plos.figshare.com
    xls
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahneela Pitafi; Toni Anwar; I Dewa Made Widia; Zubair Sharif; Boonsit Yimwadsana (2024). Details of PID image dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0313890.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shahneela Pitafi; Toni Anwar; I Dewa Made Widia; Zubair Sharif; Boonsit Yimwadsana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.

  19. f

    clustering and annotation metadata

    • figshare.com
    zip
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geoff Stanley (2020). clustering and annotation metadata [Dataset]. http://doi.org/10.6084/m9.figshare.12093471.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    figshare
    Authors
    Geoff Stanley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There are 3 files in seurat_results.zip: one containing the principal component values used for dimensionality reduction and clustering of all MSN, one containing the computed tSNE values, and one containing the louvain clusters. The metadata_final.csv file contains the annotated major cell types and subtypes.

  20. Dataset name, reference, dimensions and cell type composition.

    • plos.figshare.com
    xls
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuta Hozumi; Guo-Wei Wei (2024). Dataset name, reference, dimensions and cell type composition. [Dataset]. http://doi.org/10.1371/journal.pone.0311791.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuta Hozumi; Guo-Wei Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset name, reference, dimensions and cell type composition.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante (2023). Cluster tendency assessment in neuronal spike data [Dataset]. http://doi.org/10.1371/journal.pone.0224547
Organization logo

Cluster tendency assessment in neuronal spike data

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
pdfAvailable download formats
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sara Mahallati; James C. Bezdek; Milos R. Popovic; Taufik A. Valiante
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.

Search
Clear search
Close search
Google apps
Main menu