Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison experiments by using IF.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of DynGPE.
This dataset represents the CHANGE in the number of jobs per industry category and sub-category from the previous month, not the raw counts of actual jobs. The data behind these monthly change values is from the Bureau of Labor Statistics (BLS) Current Employment Statistics (CES) program. CES data represents businesses and government agencies, providing detailed industry data on employment on nonfarm payrolls.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1430847%2F29f7950c3b7daf11175aab404725542c%2FGettyImages-1187621904-600x360.jpg?generation=1601115151722854&alt=media" alt="">
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions
32 cheat sheets: This includes A-Z about the techniques and tricks that can be used for visualization, Python and R visualization cheat sheets, Types of charts, and their significance, Storytelling with data, etc..
32 Charts: The corpus also consists of a significant amount of data visualization charts information along with their python code, d3.js codes, and presentations relation to the respective charts explaining in a clear manner!
Some recommended books for data visualization every data scientist's should read:
In case, if you find any books, cheat sheets, or charts missing and if you would like to suggest some new documents please let me know in the discussion sections!
A kind request to kaggle users to create notebooks on different visualization charts as per their interest by choosing a dataset of their own as many beginners and other experts could find it useful!
To create interactive EDA using animation with a combination of data visualization charts to give an idea about how to tackle data and extract the insights from the data
Feel free to use the discussion platform of this data set to ask questions or any queries related to the data visualization corpus and data visualization techniques
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In modern Industry 4.0 applications, a huge amount of data is acquired during manufacturing processes and is often contaminated with outliers, which can seriously reduce the performance of control charting procedures, especially in complex and high-dimensional settings. In the context of profile monitoring, we propose a new framework that is referred to as robust multivariate functional control chart (RoMFCC) to monitor a multivariate functional quality characteristic while being robust to both functional casewise and componentwise outliers. In the former case, observations of the quality characteristic are contaminated in all functional variables or components, while, in the latter, the contamination affects one or more components independently. The RoMFCC relies on (I) a functional filter to identify componentwise outliers to be replaced by missing components; (II) a robust multivariate functional data imputation method; (III) a casewise robust dimensionality reduction; (IV) a monitoring strategy for the quality characteristic. Through a Monte Carlo simulation study, the RoMFCC is compared with competing schemes that have already appeared in the literature. A case study is finally presented where the proposed framework is used to monitor a resistance spot welding process in the automotive industry. RoMFCC is implemented in the R package funcharts, available online on CRAN.
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:
1- Data Import and Transformation:
2- Data Quality Assessment:
3- Calculating COGS:
4- Discount Analysis:
5- Sales Metrics:
6- Visualization:
7- Report Generation:
Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Associated paper: Spatio-Temporal Graph Neural Network for Urban Spaces: Interpolating Citywide Traffic Volume
Authors: Silke K. Kaiser, Filipe Rodrigues, Carlos Lima Azevedo, Lynn H. Kaack
If you use this dataset, please cite our paper:
Kaiser, S. K., Rodrigues, F., Azevedo Lima, C., & Kaack, L.H. (2025). Spatio-Temporal Graph Neural Network for Urban Spaces: Interpolating Citywide Traffic Volume. [published on arXiv].
This dataset includes street-level traffic volume data for two major urban areas:
Berlin (Strava Cycling Data): Daily bicycle traffic volumes from 2019–2023, aggregated from publicly shared Strava user data.
New York City (Taxi Data): Hourly motorized traffic volumes from Manhattan for January–February 2016, derived from GPS trajectories of yellow taxis.
Both datasets are provided at the street-segment level and come with rich auxiliary features capturing spatial, temporal, infrastructure, and contextual information.
Each city includes:
:
Full feature table for each street segment, including traffic volume and auxiliary features.
:
Geometry for each street segment.
:
Binary adjacency matrix.
:
Adjacency matrix weighted by node feature similarity.
:
Adjacency matrix based on Euclidean (bird’s-eye) distance.
:
Adjacency matrix based on real-world road network distance.
:
Adjacency matrix weighted by estimated travel time over the road network.
Volume Estimation: Strava volumes are rounded aggregates of bike trips; NYC volumes are computed from reconstructed taxi trajectories.
Filtering: Extreme outliers (e.g., from special events) are filtered per segment to focus on typical traffic conditions.
Auxiliary Features:
Built environment (e.g., speed limits, road types, lane counts)
Points of Interest (e.g., shops, schools, transit stops)
Network connectivity metrics (degree, betweenness, etc.)
Temporal indicators (weekday, holidays, hour, month)
Weather data (sunshine, precipitation, temperature)
Socioeconomic indicators (Berlin only)
Proxy motorized traffic metrics (Berlin only)
See the paper for a complete list of features and detailed methodology.
-------------------
We are grateful the European Union’s Horizon Europe research and innovation program funded this project under Grant Agreement No 101057131, Climate Action To Advance HeaLthY Societies in Europe (CATALYSE).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Values in parentheses are 95% credible intervals. Results are listed for a range of prior weights on the null model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traditional subspace feature selection methods typically rely on a fixed distance to compute residuals between the original and feature reconstruction spaces. However, this approach struggles to adapt to diverse datasets and often fails to handle noise and outliers effectively. In this paper, we propose an unsupervised feature selection method named unsupervised feature selection algorithm based on -norm feature reconstruction (NFRFS). Employing a flexible norm to represent both the original space and the spatial distance of feature reconstruction, enhances adaptability and broadens its applicability by adjusting p. Additionally, adaptive graph learning is integrated into the feature selection process to preserve the local geometric structure of the data. Features exhibiting sparsity and low redundancy are selected through the regularization constraint of the inner product in the feature selection matrix. To demonstrate the effectiveness of the method, numerical studies were conducted on 14 benchmark datasets. Our results indicate that the method outperforms 10 unsupervised feature selection algorithms in terms of clustering performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Natural plant populations are often adapted to their local climate and environmental conditions, and populations of forest trees offer some of the best examples of this pattern. However, little empirical work has focused on the relative contribution of single-locus versus multilocus effects to the genetic architecture of local adaptation in plants/forest trees. Here, we employ eastern white pine (Pinus strobus) to test the hypothesis that it is the inter-genic effects that primarily drive climate-induced local adaptation. The genetic structure of 29 range-wide natural populations of eastern white pine was determined in relation to local climatic factors using both a reference set of SSR markers, and SNPs located in candidate genes putatively involved in adaptive response to climate. Comparisons were made between marker sets using standard single-locus outlier analysis, single-locus and multilocus environment association analyses and a novel implementation of Population Graphs. Magnitudes of population structure were similar between the two marker sets. Outlier loci consistent with diversifying selection were rare for both SNPs and SSRs. However, genetic distances based on the multilocus among population covariances (cGD) were significantly more correlated to climate, even after correcting for spatial effects, for SNPs as compared to SSRs. Coalescent simulations confirmed that the differences in mutation rates between SSRs and SNPs did not affect the topologies of the Population Graphs, and hence values of cGD and their correlations with associated climate variables. We conclude that the multilocus covariances among populations primarily reflect adaptation to local climate and environment in eastern white pine. This result highlights the complexity of the genetic architecture of adaptive traits, as well as the need to consider multilocus effects in studies of local adaptation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The crcc T2 Revised statistics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Addition-point OLS matrix, B.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Augmented Projector.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bolded values are those with P-values < 0.05.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Differences in R2 values between marker types are significant using a permutation approach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Pulp-fibre Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Similarity Matrix sij=(bi−bj)′(Σ^(B))−1(bi−bj).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The details on these candidate genes, including EST loci, GenBank and TreeGenes database ID, and references for the identification of biological functions are provided in S2 Table.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The crcc, RMVE, RMCD, and classical Hotelling’s T2 statistics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128