MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
R Scripts contain statistical data analisys for streamflow and sediment data, including Flow Duration Curves, Double Mass Analysis, Nonlinear Regression Analysis for Suspended Sediment Rating Curves, Stationarity Tests and include several plots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
This dataset contains 15 minute average mobile unit box plot imagery of CO, NO2, O3, PM10, and SO2 collected during the MILAGRO field project.
Additional file 2: Supplemental Figure 1. Flowcharts of the analytic samples for FOCUS 1.0 and FOCUS 2.0 survey waves. Supplemental Figure 2A. Box and whisker plots comparing FOCUS safety climate scores by size variables for FOCUSv.1.0 departments. Supplemental Figure 2B. Box and whisker plots comparing FOCUS safety climate scores by size variables for FOCUSv.2.0 departments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
S4 Table. Box plot and the statistical analysis for the diameters measured for the NCLPs obtained by AFM.
This data release contains time series and plots summarizing mean monthly temperature (TAVE) and total monthly precipitation (PPT), and runoff (RO) from the U.S. Geological Survey Monthly Water Balance Model at 115 National Wildlife Refuges within the U.S. Fish and Wildlife Service Mountain-Prairie Region (CO, KS, MT, NE, ND, SD, UT, and WY). These three variables are derived from two sets of statistically-downscaled general circulation models from 1951 through 2099. Three variables (TAVE, PPT, and RO for refuge areas) were summarized for comparison across four 19-year periods: historic (1951-1969), baseline (1981-1999), 2050 (2041-2059), and 2080 (2071-2089). For each refuge, mean monthly plots, seasonal box plots, and annual envelope plots were produced for each of the four periods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RSV box-and-whisker diagram data for the search terms "malnutrition," "frailty," "sarcopenia," and "cachexia" from January 1, 2018 to January 1, 2022. The data is divided before and after the declaration of the COVID-19 pandemic.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.
The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.
An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.
A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.
The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.
Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Graphs are powerful and versatile data structures that can be used to represent a wide range of different types of information. In this paper, we introduce a method to analyze and then visualize an important class of data described over a graph—namely, ensembles of paths. Analysis of such path ensembles is useful in a variety of applications, in diverse fields such as transportation, computer networks, and molecular dynamics. The proposed method generalizes the concept of band depth to an ensemble of paths on a graph, which provides an center-outward ordering on the paths. This ordering is, in turn, used to construct a generalization of the conventional boxplot or whisker plot, called a path boxplot, which applies to paths on a graph. The utility of path boxplot is demonstrated for several examples of path ensembles including paths defined over computer networks and roads.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Visualization
a. Scatter plot
i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for
any pair of genes.
ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint -
visit https://plotly.com/r/, https://plotly.com/python)
iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
Gender/Sex column from the metadata file.
b. Boxplot/violin plot
i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across
multiple categories as defined by user selected variable (a column from the metadata file)
ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
###############################################################################
### Source code and problem instances:
###############################################################################
- The source code of METAFOR will be made available on github once the paper has been accepted for publication. In the meantime, the code is provided in METAFOR.zip
- The list of problem instances (single objective continuous functions) are provided in the attached file "instances.zip"
###############################################################################
###############################################################################
### Experiment folders:
###############################################################################
The folders "default", "leave25OUT", "leave25OUTCEC14", "leaveLDO" and "leaveLDOCEC14" (compressed in zip format to save space) contain all the data collected during the experiment. Each of them contains the following folders/files:
- "candidates/candidates.txt" -- a text file with the algorithms specified as command lines,
- "candidates/OUTPUT" -- a folder containing a text file with the best solutions found by each algorithm for each problem instance. The names of the files in the folder are composed of the test suite "c_<0,1,2,3>", the number of the function in the test suite "f_<0,...,n>", and the number of dimensions "d_<50,100,500,750,...,d>".
- "DataAndPlots" -- a folder created automatically by the "plot_bxp_rtd_wlx.sh" processing script (see below).
Inside this folder are the following subfolders:
- "DataAndPlots/Bxp" -- stores the box plots created based on the data in "DataAndPlots/Data";
- "DataAndPlots/Cvg" -- (if any)stores the convergence plots generated based on the data in "OUTPUT_processed";
- "DataAndPlots/Data" -- stores the processed data and statistical information (median, median error, statistical test, etc.) of the raw data stored in "candidates/OUTPUT";
- "DataAndPlots/Time" -- stores the average time taken by the algorithms on each problem instance.
- "DataAndPlots/Table" -- (if any) stores, in plain text and pseudo-LaTeX format, the tables of results reported in the paper, i.e., median, median error, median absolute deviation, rankings, statistical tests, and number of wins.
- "OUTPUT" -- stores the convergence data of the algorithms (i.e. function evaluations vs. solution quality).
In the latest version of METAFOR, each convergence file consists of 100 points; however, in the version we used for the experiments reported in the paper, each file consists of thousands of points per algorithm, making this folder particularly heavy.
The data in folder "candidates/OUTPUT" and "OUTPUT" is gathered via the script "runMe.sh" indicating an experiment folder and instances file, namely:
- in the case of "default", we solved instances "test_MIXTURE_max200.txt" and "test_MIXTURE_onlyLargeScale.txt";
- in the case of "leave25OUT", we solved instances "test_MIXTURE_max200.txt", "test_MIXTURE_onlyLargeScale.txt" and "test_MIXTURE_onlyLargeScaleDND.txt";
- in the case of "leave25OUTCEC14", we solved instances "test_CEC14.txt";
- in the case of "leaveLDO", we solved instances "test_MIXTURE_max200.txt", "test_MIXTURE_onlyLargeScale.txt" and "test_MIXTURE_onlyLargeScaleDND.txt";
- in the case of "leaveLDOCEC14", we solved instances "test_CEC14.txt".
###############################################################################
###############################################################################
### Processing scripts:
###############################################################################
The scripts folder (scripts.zip) contains the main processing script "plot_bxp_rtd_wlx.sh" and several auxiliary R and shell scripts: "boxplot.R", "cvg_log.R", "wilcoxon.R", "ranksPerClass.R", "filter_repeating.sh", "full_outer_join.sh", and "replace_na.sh". All the auxiliary scripts are automatically called by the main script, depending on the options specified by the user, and are intended to be used standalone. The R axiliary scripts are used to generate the box plots ("boxplot.R") and convergence plots ("cvg_log.R"), to perform the statistical test ("wilcoxon.R"), and to compute the rankings ("ranksPerClass.R"). The auxiliary shell scripts are used to clean the raw data stored in "OUTPUT" and create a file called data-*-mean.txt for each data file, which can be entered into "cvg_log.R" to generate the convergence plots.
###############################################################################
###############################################################################
### Folders with experiments:
###############################################################################
Since in the paper we report different sets of algorithms solving different sets of problems. We created a folder (compressed in zip format for space reasons) for each of them and put in it only the specific data that we want to analyze and plot. The data inside these folders is simply copy-paste from the main experiment folders, and it is as follows:
- "METAFOR/exp1_dftVStuned" contains the results discussed in section 5.3.1 of the paper.
- "METAFOR/exp2_mtfVSHyb" contains the results discussed in section 5.3.2 of the paper.
- "METAFOR/exp3_CEC14" and "METAFOR/exp4_LS" contain the results discussed in section 5.3.3 of the paper.
###############################################################################
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BOX reported $1.67B in Assets for its fiscal quarter ending in January of 2025. Data for BOX - Assets including historical, tables and charts were last updated by Trading Economics this last July in 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BOX reported $194.01M in Net Income for its fiscal quarter ending in January of 2025. Data for BOX - Net Income including historical, tables and charts were last updated by Trading Economics this last July in 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Estimate of Median Household Income for Box Butte County, NE was 66613.00000 $ in January of 2023, according to the United States Federal Reserve. Historically, Estimate of Median Household Income for Box Butte County, NE reached a record high of 66762.00000 in January of 2022 and a record low of 31555.00000 in January of 1989. Trading Economics provides the current actual value, an historical data chart and related indicators for Estimate of Median Household Income for Box Butte County, NE - last updated from the United States Federal Reserve on June of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides supporting data for the figures presented in our study on electric vehicle (EV) usage and charging behavior across major Chinese cities. The detailed analysis and raw data are thoroughly described in Zhan et al (2025). The study examines 1.69 million EVs, representing 42% of China's total EV fleet, from November 2020 to October 2021. The study provides insights into operational demands, infrastructure requirements, and energy consumption patterns by analyzing diverse vehicle types—including private cars, taxis, buses, and special purpose vehicles (SPVs).
The purpose of this dataset is to enable researchers who do not have access to the same raw data to replicate, calibrate, or extend our findings using the processed data that underpins each figure. This resource is valuable for further research on EV infrastructure planning, energy consumption, and vehicle performance. This dataset is made available to help the research community leverage our findings and facilitate advancements in electric vehicle research and infrastructure planning. Please refer to Zhan et al (2025) for full details on the methodology and analysis.
This dataset includes the processed data underlying each figure in Zhan et al (2025), covering various aspects of EV usage, battery capacity, and charging behavior across seven major Chinese cities: Beijing, Shanghai, Guangzhou, Shenzhen, Nanjing, Chengdu, and Chongqing. The dataset is organized to correspond directly with the figures in the paper, facilitating its use for further analysis and model calibration. Each dataset is aligned with specific figures, providing essential data to help researchers without access to the original raw data.
Fig1a.Distribution of EV types across selected Chinese cities
File: Fig1a.Distribution of EV types across selected Chinese cities.csv
Description: Distribution of EV types across seven cities, detailing the share of different vehicle types.
Column |
Description |
Data type |
Unit |
Beijing |
Distribution of EV types in Beijing |
Float |
% |
Shenzhen |
Distribution of EV types in Shenzhen |
Float |
% |
Shanghai |
Distribution of EV types in Shanghai |
Float |
% |
Guangzhou |
Distribution of EV types in Guangzhou |
Float |
% |
Chengdu |
Distribution of EV types in Chengdu |
Float |
% |
Chongqing |
Distribution of EV types in Chongqing |
Float |
% |
Nanjing |
Distribution of EV types in Nanjing |
Float |
% |
Fig1b.Distribution of battery energy by vehicle types
File: Fig1b.Distribution of battery energy by vehicle types.csv
Description: Distribution of battery energy across different vehicle types, represented as box plot statistics.
Column |
Description |
Data type |
Unit |
type_2 |
vehicle types |
String |
- |
Lower Whisker |
The battery energy corresponding to the Lower Whisker of the box plot. |
Float |
kWh |
Q1 (25%) |
The 25th percentile value of battery energy. |
Float |
kWh |
Median (50%) |
The median value of battery energy. |
Float |
kWh |
Q3 (75%) |
The 75th percentile value of battery energy. |
Float |
kWh |
Upper Whisker |
The battery energy corresponding to the Upper Whisker of the box plot. |
Float |
kWh |
Fig1c.Variations of battery energy of buses
File: Fig1c.Variations of battery energy of buses across studied cities.csv
Description: Battery energy variations for buses across the studied cities.
Column |
Description |
Data type |
Unit |
city_En |
English name of 7 Chinese city |
String |
- |
Lower Whisker |
The battery energy of buses corresponding to the Lower Whisker of the box plot. |
Float |
kWh |
Q1 (25%) |
The 25th percentile value of battery energy of buses. |
Float |
kWh |
Median (50%) |
The median value of battery energy of buses. |
Float |
kWh |
Q3 (75%) |
The 75th percentile value of battery energy of buses. |
Float |
kWh |
Upper Whisker |
The battery energy of buses corresponding to the Upper Whisker of the box plot. |
Float |
kWh |
Fig1d.Variations of battery energy of SPVs
File: Fig1c.Variations of battery energy of SPVs across studied cities.csv
Description: Battery energy variations for special purpose vehicles (SPVs) across cities.
Column |
Description |
Data type |
Unit |
city_En |
English name of 7 Chinese city |
String |
- |
Lower Whisker |
The battery energy of SPVs corresponding to the Lower Whisker of the box plot. |
Float |
kWh |
Q1 (25%) |
The 25th |
Script graphs box plots of DBI scores for all metro areas, grouping by year and metropolitan area population size (larger or smaller than 250,000 people). Additional scripts create different graphs. Data are provided in both "long" and "tall" formats.
PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed vocabulary.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Producer Price Index by Industry: Corrugated and Solid Fiber Box Manufacturing: Corrugated and Solid Fiber Boxes, Including Pallets (PCU3222113222110) from Mar 1980 to May 2025 about fiber, paper, manufacturing, PPI, industry, inflation, price index, indexes, price, and USA.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Producer Price Index by Industry: Corrugated and Solid Fiber Box Manufacturing: Corrugated Shipping Containers for All Other End Uses (PCU32221132221104) from Mar 1980 to May 2025 about end use, fiber, paper, manufacturing, PPI, industry, inflation, price index, indexes, price, and USA.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
R Scripts contain statistical data analisys for streamflow and sediment data, including Flow Duration Curves, Double Mass Analysis, Nonlinear Regression Analysis for Suspended Sediment Rating Curves, Stationarity Tests and include several plots.