Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
DrCyZ: Techniques for analyzing and extracting useful information from CyZ.
Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.
Repository: https://github.com/decurtoidiaz/drcyz
Subset of samples from (includes tools to visualize and analyse the dataset):
CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]
Images from NASA missions of the celestial body.
Repository: https://github.com/decurtoidiaz/cyz
Authors:
J. de Curtò c@decurto.be
I. de Zarzà z@dezarza.be
• Subset of samples from Perseverance (drcyz/c).
∙ png (drcyz/c/png).
PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering.
∙ csv (drcyz/c/csv).
CSV file.
• Resized samples from Perseverance (drcyz/c+).
∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
PNG files resized at the corresponding size.
∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
TFRecord resized at the corresponding size to import on Tensorflow.
• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
PNG files subset of 100, 1000 and 10000 at size 256x256.
• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
∙ network-snapshot-000798-drcyz.pkl
• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Curiosity.
∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Perseverance.
∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4: CSV file of colon crypt bulk RNA-seq data used for GECO UMAP generation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 5: CSV file of bulk RNA-seq data of F. nucleatum infection time course used for GECO UMAP generation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Plants are often attacked by various pathogens during their growth, which may cause environmental pollution, food shortages, or economic losses in a certain area. Integration of high throughput phenomics data and computer vision (CV) provides a great opportunity to realize plant disease diagnosis in the early stage and uncover the subtype or stage patterns in the disease progression. In this study, we proposed a novel computational framework for plant disease identification and subtype discovery through a deep-embedding image-clustering strategy, Weighted Distance Metric and the t-stochastic neighbor embedding algorithm (WDM-tSNE). To verify the effectiveness, we applied our method on four public datasets of images. The results demonstrated that the newly developed tool is capable of identifying the plant disease and further uncover the underlying subtypes associated with pathogenic resistance. In summary, the current framework provides great clustering performance for the root or leave images of diseased plants with pronounced disease spots or symptoms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial results with respect to dimensionality reduction of taxi PickUp-DropOut regions from New York City, Manhattan region, YellowCab company (2018 year, first 7 months). The dimensionality reduction is done separately for all working days and weekends using t-SNE, an SVD, and a simple deep autoencoder. The clustering quality assessment in two-dimensional space in which dimensionality reduction is done is conducted by using Silhouette, Calinski-Harabasz, and Davies-Bouldin metrics. Furthermore, the 15-minute taxi data aggregation is used.
Facebook
TwitterThis dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Facebook
TwitterDataset and labels for the article of Keratoconus severity identification using unsupervised machine learning by Siamak Yousefi
Facebook
TwitterAdditional file 1: Supplementary Figure 1. Related to Fig. 1. MMSP50 is a representative donor of the weight loss cohort. A Examination of baseline alpha diversity demonstrates that MMSP50 is at the 54th ranked percentile for baseline diversity after VLCD. B Their baseline microbiota composition (principal coordinates analysis of Bray-Curtis Dissimilarity) is well within the 95% confidence interval of baseline composition for the cohort (dotted line) and C their change in community structure is the 19th percentile for change in composition. Supplementary Figure 2. Related to Fig. 2. No significant changes in energy loss or fecal content after microbial colonization. Metabolic analysis of germ-free (GF) mice and mice inoculated with the AdLib and CalRes human gut microbiota. A-D Energy loss (A), fecal energy content (B), food consumption (C), and energy absorption (D) were measured using bomb calorimetry in GF and colonized mice. E Body weights in g. ** P < 0.01, *** P < 0.001 as determined using 2-way ANOVA with Bonferonni’s post-test correction for multiple comparisons. error bars = SEM. Supplementary Figure 3. Related to Fig. 3. Differential expression of surface markers in different colonic immune cell clusters of germ-free and colonized mice. A The heatmap shows differentially distributed colonic immune cell phenotypes quantified by PhenoGraph clustering. The distribution of each cell cluster (rows) is shown for each murine sample (columns). B The heatmap shows the distribution of colonic immune lineages based on the expression of canonical lineage markers by t-SNE on all colonic viable CD45+ leukocytes. The differential expression of each selected surface marker (rows) is shown for each immune cell cluster (columns). The significance levels of the comparison between the groups for each immune cell cluster are depicted by semi-supervised hierarchical clustering. The top bubbles denote clusters with significantly different abundances between the groups. Bubble colors indicate the one of the two groups being compared with higher average cellular frequencies; bubble size indicates the -log2 FDR-adjusted p-values. Visualization of all colonic viable CD45+ leukocytes by t-SNE. Overlayed colors represent Phenograph clusters as defined in heatmap. C-G Absolute numbers of colonic leukocytes (C), CD4+ T cells (D), CD8+ T cells (E), B cells (F), and NK cells (G) defined by manual gating of mass cytometry data, from germ-free (GF) mice and mice colonized with the AdLib and CalRes human gut microbiota from the top weight loser of an 8-week weight loss intervention study (n=9 or more mice per group). * P<0.05, ANOVA with Bonferonni’s post-test correction for multiple comparison. Supplementary Figure 4. Related to Fig. 4. Differential expression of surface markers in different splenic immune cell clusters of germ-free and colonized mice. A The heatmap shows differentially distributed splenic immune cell phenotypes quantified by PhenoGraph clustering. The distribution of each cell cluster (rows) is shown for each murine sample (columns). B The heatmap shows the distribution of splenic immune lineages based on the expression of canonical lineage markers by t-SNE on all colonic viable CD45+ leukocytes. The differential expression of each selected surface marker (rows) is shown for each immune cell cluster (columns). The significance levels of the comparison between the groups for each immune cell cluster are depicted by semi-supervised hierarchical clustering. The top bubbles denote clusters with significantly different abundances between the groups. Bubble colors indicate the one of the two groups being compared with higher average cellular frequencies; bubble size indicates the -log2 FDR-adjusted p-values. Visualization of all splenic viable CD45+ leukocytes by t-SNE. Overlayed colors represent Phenograph clusters as defined in heatmap. C-G Absolute numbers of splenic leukocytes (C), CD4+ T cells (D), CD8+ T cells (E), B cells (F), and NK cells (G) defined by manual gating of mass cytometry data, from germ-free (GF) mice and mice colonized with the AdLib and CalRes human gut microbiota from the top weight loser of an 8-week weight loss intervention study (n=9 or more mice per group). * P<0.05, ANOVA with Bonferonni’s post-test correction for multiple comparison. Supplementary Figure 5. Related to Fig. 5. Differential expression of surface markers in different hepatic immune cell clusters of germ-free and colonized mice. A The heatmap shows differentially distributed hepatic immune cell phenotypes quantified by PhenoGraph clustering. The distribution of each cell cluster (rows) is shown for each murine sample (columns). B The heatmap shows the distribution of hepatic immune lineages based on the expression of canonical lineage markers by t-SNE on all colonic viable CD45+ leukocytes. The differential expression of each selected surface marker (rows) is shown for each immune cell cluster (columns). The significance levels of the comparison between the groups for each immune cell cluster are depicted by semi-supervised hierarchical clustering. The top bubbles denote clusters with significantly different abundances between the groups. Bubble colors indicate the one of the two groups being compared with higher average cellular frequencies; bubble size indicates the -log2 FDR-adjusted p-values. Visualization of all hepatic viable CD45+ leukocytes by t-SNE. Overlayed colors represent Phenograph clusters as defined in heatmap. C-G Absolute numbers of hepatic leukocytes (C), CD4+ T cells (D), CD8+ T cells (E), B cells (F), and NK cells (G) defined by manual gating of mass cytometry data, from germ-free (GF) mice and mice colonized with the AdLib and CalRes human gut microbiota from the top weight loser of an 8-week weight loss intervention study (n=9 or more mice per group). ANOVA with Bonferonni’s post-test correction for multiple comparison. Supplementary Figure 6. Related to Fig. 6. Gut microbial community structure slightly affects composition and activation of liver immune cells. The heatmap shows latent correlation matrix between abundances of amplicon sequence variants (ASVs) detected in stool samples and all immune parameters analyzed in liver of mice 21 days after inoculation with AdLib and CalRes human gut microbiota. Immune parameters are expressed as frequencies, i.e., percent of parent, except those labeled # which were quantified as absolute cell counts. Heatmap was ordered according to rows and columns first principal components to highlight the cross-correlation structure. Asterisks indicate variables that were selected in L1-penalized sparse canonical correlation analysis (CCA). Circular chord plots display latent correlation between frequencies of manually defined immune subsets and L1-selected ASVs including the top ten taxa that either positively or negatively associate with the immunological dataset. Blue to red colour scale in heatmap and chords indicates negative and positive correlation values. Color of row-legend bar and species labels denotes the phylum level. Colors of column legend bars indicate parental lineage and differentiation level (antigen-experience) of lymphocyte subsets, respectively. The boxplot inset shows how experimental groups as a latent variable are not well-explained by the sparse canonical covariate.
Facebook
TwitterThis repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages.
The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation.
The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods.
Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included.
All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community.
01_processing_scripts/
R scripts to transform raw data, subtract baseline energy, and produce clean metrics.
multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time.
adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files.
energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations.
02_raw_data/
Raw output from the tokenization experiment and baseline profiler.
all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks).
baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization.
03_clean_data/
Cleaned, enriched, and reshaped datasets ready for analysis.
net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run).
tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts.
tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric.
complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions.
metadata.csv⤷ Structural features and script-based character stats per chunk.
04_cluster_outputs/
Outputs from clustering and dimensionality reduction over tokenizer energy profiles.
tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles.
tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage.
mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster.
sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster.
grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.
Facebook
TwitterBy Kelly Garrett [source]
This dataset is a collection of 20,000 questions addressed to the advice columnist Dear Abby, providing valuable insights into American anxieties and concerns from the mid-1980s to 2017. It was used in The Pudding essay titled 30 Years of American Anxieties: What 20,000 letters to an advice columnist tell us about what—and who—concerns us most, published in November 2018.
The dataset includes information such as the URL and title of the articles or publications where the questions were published. It also contains the text of the questions asked by readers. These questions were publicly available on websites as well as obtained from digital copies of newspapers that included Dear Abby sections.
It is important to note that this dataset does not include any updates.
The writers of these questions are predominantly female (approximately two-thirds) based on demographics mentioned by Pauline Phillips, which were collected through a survey she conducted in 1987. However, there is limited information available about their origins or other demographic data. Additionally, it should be acknowledged that only a fraction of all written-in questions were published because advice columnists selectively choose which ones to feature.
Despite these limitations, this dataset offers a glimpse into important societal concerns over time. For instance, it reflects issues like the HIV/AIDS crisis during the 1980s. With over 20,000 questions spanning several decades, it provides a directional understanding of broader trends.
The essay based on this dataset highlighted three main themes: sex, LGBTQ issues, and religion. To analyze these topics further, relevant keywords were used for each issue to create broad groupings and then narrow down into specific categories.
In addition to these themes, questions related to parents, children friends and bosses were also explored using a visual clustering technique called t-SNE (t-distributed Stochastic Neighbor Embedding). Manual categorization was also employed by tagging relevant entries within those groupings.
It's important to note that the dataset does not include any information about the dates of publication or data collection.
Overall, this dataset provides a comprehensive view of American anxieties and concerns over several decades, offering insights into cultural shifts and societal issues
Understanding the Columns: The dataset consists of several columns that provide valuable information about each question.
question_only: This column contains the text of the question asked by the reader.title: This column contains the title or headline of the article or publication where the question was published.url: This column contains the URL of the article or publication where the question was published.Analyzing Specific Topics: The dataset covers various topics that were concerning Americans during this time period. You can use specific keywords related to these topics in combination with text analysis techniques to gain insights into public concerns and attitudes.
Common themes covered in this dataset include sex, LGBTQ issues, religion, parents, children, friends, and bosses.
Keyword Analysis: To analyze specific topics or themes within this dataset effectively, it is recommended to create a list of relevant keywords related to your research interests. These keywords can be used for filtering and searching within columns like
question_onlyortitle.Text Analysis Techniques: You can apply various text analysis techniques on this dataset such as sentiment analysis, topic modeling (using methods like Latent Dirichlet Allocation), word frequency analysis, sentiment analysis etc., based on your research goals.
Visualizations and Clustering: If you are interested in exploring patterns or trends in relationship between different variables present in each question (such as topic clustering), you can utilize visualization techniques like t-SNE to create visual representations.
You can also apply other clustering algorithms or network analysis techniques to gain additional insights from the dataset.
Remember to acknowledge the source when using this dataset: The Pudding essay 30 Years of American Anxieties: What 20,000 letters to an advice columnist tell us about what—and who—concerns us most. published in November 2018.
Plea...
Facebook
TwitterAdditional file 1: Figure S1. Quality control and batch effect correction in scRNA-Seq, related to Figure 1 A. Violin plots showing the number of expressed genes, the number of reads uniquely mapped against the reference genome, and the fraction of mitochondrial genes compared to all genes per cell in scRNA-Seq data. B. Box plot showing the number of genes (left) and the number of uniquely mapped reads (right) per cell in each identified cell type in scRNA-Seq data. C. tSNE plot visualization of the sample source for all 70,201 cells. Each dot is a cell. Different colors represent different samples. D. tSNE plot visualization of unsupervised clustering analysis for all 70,201 cells based on scRNA-Seq data after quality control, which gave rise to 31 distinct clusters. Figure S2. Gene Ontology (GO) analysis of the DEGs for each cell type was performed and the representative enriched GO terms are presented, related to Figure 1. Figure S3. Expression of selected marker genes along the differentiation trajectory, related to Figure 2 A. tSNE plot demonstrating cell cycle regression (left). Visualization of myogenic differentiation trajectory by cell cycle phases (G1, S, and G2/M) (right). B. Donut plots showing the percentages of cells in G1, S, and G2M phase at different cell states. C. Expression levels of cell cycle-related genes in the myogenic cells organized into the Monocle trajectory. D. Expression levels of muscle related genes in the myogenic cells organized into the Monocle trajectory. Figure S4. Unsupervised clustering analysis for all cells in scATAC-Seq data and myogenic-specific scATAC-seq peaks, related to Figure 4 A-C. tSNE plot visualization of the sample source for all 48514 cells in scATAC-Seq. Each dot is a cell. Different colors represent different pigs (A), different embryonic stages (B), or different samples (C). D. tSNE plot visualization of unsupervised clustering analysis for all 48514 cells after quality control in scATAC-Seq data, which gave rise to 15 distinct clusters. E. tSNE plot visualization of myogenic cells and other cells. Clusters 4 and 8 in Figure S4D were annotated as myogenic cells due to their high levels of accessibility of marker genes associated with myogenic lineage. F. Genome browser view of myogenic-specific peaks at the TSS of MyoG and Myf5 for myogenic cells and other cells in the scATAC-seq dataset. Figure S5. Percentage distribution of open chromatin elements in scATAC-Seq data, related to Figure 4 A. Distribution of open chromatin elements in each snATAC-seq sample. B. Distribution of open chromatin elements in snATAC-seq of myogenic cell types. C. Percentage distribution of open chromatin elements among DAPs in myogenic cell types. Figure S6. Integrative analysis of transcription factors and target genes, related to Figure 5 A. tSNE depiction of regulon activity (“on-blue”, “off-gray”), TF gene expression (red scale), and expression of predicted target genes (purple scale) of MyoG, FOSB, and TCF12. B. Corresponding chromatin accessibility in scATAC data for TFs and predicted target genes are depicted. Figure S7. Pseudotime-dependent chromatin accessibility and gene expression changes, related to Figure 7. The first column shows the dynamics of the 10× Genomics TF enrichment score. The second column shows the dynamics of TF gene expression values, and the third and fourth columns represent the dynamics of the SCENIC-reported target gene expression values of corresponding TFs, respectively. Figure S8. Myogenesis related gene expression in DMD (Duchenne muscular dystrophy) mice. Comparison of RNA-seq data of flexor digitorum short (FDB), extensor digitorum long (EDL), and soleus (SOL) in DMD and wild-type mice including 2- month and 5-month age. A. The expression levels of myogenesis related genes (Myod1, Myog, Myf5, Pax7). B. The expression levels of related genes that were upregulated during porcine embryonic myogenesis (EGR1, RHOB, KLF4, SOX8, NGFR, MAX, RBFOX2, ANXA6, HES6, RASSF4, PLS3, SPG21). C. The expression levels of related genes that were downregulated during porcine embryonic myogenesis COX5A, HOMER2, BNIP3, CNCS). Data were obtained from the GEO database (GSE162455; WT, n = 4; DMD, n = 7). Figure S9. Genome browser view of differentially accessible peaks at the TSS of EGR1 and RHOB between myogenic cells in the scATAC-seq dataset, related to Figure 8. Figure S10. Functional analysis of EGR1 in myogenesis, related to Figure 8 A-B. EdU assays for the proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following EGR1 overexpression. C. qPCR analysis of the mRNA levels of cell cycle regulators in C2C12 cells following EGR1 overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following EGR1 overexpression and differentiation for 3 d. Then, the fusion index was calculated. Figure S11. Functional analysis of RHOB in myogenesis, related to Figure 8 A-B. EdU assays for proliferation of pig primary myogenic cells (A) and C2C12 myoblasts following RHOB overexpression. C. qPCR analysis of the mRNA levels of cell-cycle regulators in C2C12 cells following RHOB overexpression. D. Immunofluorescence staining for MyHC in C2C12 cells following RHOB overexpression and differentiation for 3 d. Then, the fusion index was calculated.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets linked to publication "Revealing viral and cellular dynamics of HIV-1 at the single-cell level during early treatment periods", Otte et al 2023 published in Cell Reports Methods pre-ART (antiretroviral therapy) cryo-conserved and and whole blood specimen were sampled for HIV-1 virus reservoir determination in HIV-1 positive individuals from the Swiss HIV Study Cohort. Patients were monitored for proviral (DNA), poly-A transcripts (RNA), late protein translation (Gag and Envelope reactivation co-detection assay, GERDA) and intact viruses (golden standard: viral outgrowth assay, VOA). In this dataset we deposited the pipeline for the multidimensional data analysis of our newly established GERDA method, using DBScan and tSNE. For further comprehension NGS and Sanger sequencing data were attached as processed and raw data (GenBank).
Resubmitted to Cell Reports Methods (Jan-2023), accepted in principal (Mar-2023)
GERDA is a new detection method to decipher the HIV-1 cellular reservoir in blood (tissue or any other specimen). It integrates HIV-1 Gag and Env co-detection along with cellular surface markers to reveal 1) what cells still contain HIV-1 translation competent virus and 2) which marker the respective infected cells express. The phenotypic marker repertoire of the cells allow to make predictions on potential homing and to assess the HIV-1 (tissue) reservoir. All FACS data were acquired on a LSRFortessa BD FACS machine (markers: CCR7, CD45RA, CD28, CD4, CD25, PD1, IntegrinB7, CLA, HIV-1 Env, HIV-1 Gag) Raw FACS data (pre-gated CD4CD3+ T-cells) were arcsin transformed and dimensionally reduced using optsne. Data was further clustered using DBSCAN and either individual clusters were further analyzed for individual marker expression or expression profiles of all relevant clusters were analyzed by heatmaps. Sequences before/after therapy initiation and during viral outgrowth cultures were monitored for individuals P01-46 and P04-56 by Next-generation sequencing (NGS of HIV-1 Envelope V3 loop only) and by Sanger (single genome amplification, SGA)
data normalization code (by Julian Spagnuolo) FACS normalized data as CSV (XXX_arcsin.csv) OMIQ conText file (_OMIQ-context_XXX) arcsin normalized FACS data after optsne dimension reduction with OMIQ.ai as CSV file (XXXarcsin.csv.csv) R pipeline with codes (XXX_commented.R) P01_46-NGS and Sanger sequences P04_56-NGS and Sanger sequences
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perimeter Intrusion Detection Systems (PIDS) are crucial for protecting any physical locations by detecting and responding to intrusions around its perimeter. Despite the availability of several PIDS, challenges remain in detection accuracy and precise activity classification. To address these challenges, a new machine learning model is developed. This model utilizes the pre-trained InceptionV3 for feature extraction on PID intrusion image dataset, followed by t-SNE for dimensionality reduction and subsequent clustering. When handling high-dimensional data, the existing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm faces efficiency issues due to its complexity and varying densities. To overcome these limitations, this research enhances the traditional DBSCAN algorithm. In the enhanced DBSCAN, distances between minimal points are determined using an estimation for the epsilon values with the Manhattan distance formula. The effectiveness of the proposed model is evaluated by comparing it to state-of-the-art techniques found in the literature. The analysis reveals that the proposed model achieved a silhouette score of 0.86, while comparative techniques failed to produce similar results. This research contributes to societal security by improving location perimeter protection, and future researchers can utilize the developed model for human activity recognition from image datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
t-SNE silhouette scores based on clusters formed from the t-SNE embeddings of each model, and labelled using the ground truth labels from each dataset. The t-SNE embeddings were calculated from each model’s activations of the final convolutional layer. The combined score measures silhouette score when the synthetic image and photograph testing datasets were combined. The highest silhouette score for each dataset is underlined and italicised.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative cluster evaluation metrics on t-SNE feature embeddings for different models. Higher Silhouette and Calinski–Harabasz scores indicate better class separation, while lower Davies–Bouldin values denote tighter intra-class clustering.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.