Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
The U.S. Geological Survey, in cooperation with the U.S. Environmental Protection Agency's Long Island Sound Study (https://longislandsoundstudy.net), characterized nitrogen export from forested watersheds and whether nitrogen loading has been increasing or decreasing to help inform Long Island Sound management strategies. The Weighted Regressions on Time, Discharge, and Season (WRTDS; Hirsch and others, 2010) method was used to estimate annual concentrations and fluxes of nitrogen species using long-term records (14 to 37 years in length) of stream total nitrogen, dissolved organic nitrogen, nitrate, and ammonium concentrations and daily discharge data from 17 watersheds located in the Long Island Sound basin or in nearby areas of Massachusetts, New Hampshire, or New York. This data release contains the input water-quality and discharge data, annual outputs (including concentrations, fluxes, yields, and confidence intervals about these estimates), statistical tests for trends between the periods of water years 1999-2000 and 2016-2018, and model diagnostic statistics. These datasets are organized into one zip file (WRTDSeLists.zip) and six comma-separated values (csv) data files (StationInformation.csv, AnnualResults.csv, TrendResults.csv, ModelStatistics.csv, InputWaterQuality.csv, and InputStreamflow.csv). The csv file (StationInformation.csv) contains information about the stations and input datasets. Finally, a short R script (SampleScript.R) is included to facilitate viewing the input and output data and to re-run the model. Reference: Hirsch, R.M., Moyer, D.L., and Archfield, S.A., 2010, Weighted Regressions on Time, Discharge, and Season (WRTDS), with an application to Chesapeake Bay River inputs: Journal of the American Water Resources Association, v. 46, no. 5, p. 857–880.
The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The GLC has been compiled since 2007 at NASA Goddard Space Flight Center. This is a unique data set with the ID tag “GLC” in the landslide editor. This dataset on data.nasa.gov was a one-time export from the Global Landslide Catalog maintained separately. It is current as of March 7, 2016. The original catalog is available here: http://www.arcgis.com/home/webmap/viewer.html?url=https%3A%2F%2Fmaps.nccs.nasa.gov%2Fserver%2Frest%2Fservices%2Fglobal_landslide_catalog%2Fglc_viewer_service%2FFeatureServer&source=sd To export GLC data, you must agree to the “Terms and Conditions”. We request that anyone using the GLC cite the two sources of this database: Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S., & Lerner-Lam, A. (2010). A global landslide catalog for hazard applications: method, results, and limitations. Natural Hazards, 52(3), 561–575. doi:10.1007/s11069-009-9401-4. [1] Kirschbaum, D.B., T. Stanley, Y. Zhou (In press, 2015). Spatial and Temporal Analysis of a Global Landslide Catalog. Geomorphology. doi:10.1016/j.geomorph.2015.03.016. [2]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
## Data sources
Folder 01_SourceData/
- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
## Automatic classification
Folder 02_AutomaticClassification/
- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
- oddpub_results_wDOIs.csv (results file of the ODDPub classification)
- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
## Manual coding
Folder 03_ManualCheck/
- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
- ManualCheck_2023-06-08.csv (Manual coding results file)
- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
## Explorative analysis for the discoverability of open data
Folder04_FurtherAnalyses
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
## R-Script
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Time Series Data Handling and Quality Assurance Review
Most instruments had internal logging and special software to download data from the field instruments as binary files or ascii/csv files. The instruments for which files downloaded as binary provide software to view the data or export the data to csv files.
One-minute resolution time-series data files were created for each house using an R script that pulled data from the csv files, aligned data by time, executed unit conversions, and translated from instruments with longer or different data intervals (e.g. 30 min formaldehyde data and 1.5 min for anemometer data). Visual review was conducted on the compiled files (and primary csv or binary files were consulted as needed) to check for translation or writing errors (especially from terminal emulator), indications of instrument malfunction, mislabeled units or unit conversion errors, mislabeled location, and time stamp errors.
The draft final set of time-series data&nb...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package
This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".
Requirements
We recommend the following requirements to replicate our study:
Package Structure
We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:
data-analysis
, an R-based Container we used to run our data analysis.data-collection
, a Python Container we used to collect Scikit's default arguments and detect them in client applications.database
, a Postgres Container we used to store clients' data, obtainer from Grotov et al.storage
, a directory used to store the data processed in data-analysis
and data-collection
. This directory is shared in both containers.docker-compose.yml
, the Docker file that configures all containers used in the package.In the remainder of this document, we describe how to set up each container properly.
Using VSCode to Setup the Package
We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis
and data-collection
containers. This way you can directly access and run each container inside it without any specific configuration.
You first need to set up the containers
$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers
Then, you can open them in Visual Studio Code:
If you want/need a more customized organization, the remainder of this file describes it in detail.
Longest Road: Manual Package Setup
Database Setup
The database container will automatically restore the dump in dump_matroskin.tar
in its first launch. To set up and run the container, you should:
Build an image:
$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB
Create and enter inside the container:
$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-------
public | Cell | table | root
public | Code_cell | table | root
public | Md_cell | table | root
public | Notebook | table | root
public | Notebook_features | table | root
public | Notebook_metadata | table | root
public | repository | table | root
If you got the tables list as above, your database is properly setup.
It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features
(API_functions_calls
, defined_functions_calls
, andother_functions_calls
) containing the function calls performed by each client in the database.
Data Collection Setup
This container is responsible for collecting the data to answer our research questions. It has the following structure:
dabcs.py
, extract DABCs from Scikit Learn source code, and export them to a CSV file.dabcs-clients.py
, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.Makefile
, commands to set up and run both dabcs.py
and dabcs-clients.py
matroskin
, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.storage
, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.requirements.txt
, Python dependencies adopted in this module.Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:
$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
If you see project files, it means the container is configured accordingly.
Data Analysis Setup
We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:
dependencies.R
, an R script containing the dependencies used in our data analysis.data-analysis.Rmd
, the R notebook we used to perform our data analysisdatasets
, a docker volume pointing to the storage
directory.Execute the following commands to run this container:
$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
If you see project files, it means the container is configured accordingly.
A note on storage
shared folder
As mentioned, the storage
folder is mounted as a volume and shared between data-collection
and data-analysis
containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile
inside storage
folder.
$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Electoral Commission of Queensland is responsible for the Electronic Disclosure System (EDS), which provides real-time reporting of political donations. It aims to streamline the disclosure process while increasing transparency surrounding gifts.\r \r All entities conducting or supporting political activity in Queensland are required to submit a disclosure return to the Electoral Commission of Queensland. These include reporting of gifts and loans, as well as periodic reporting of other dealings such as advertising and expenditure. EDS makes these returns readily available to the public, providing faster and easier access to political financial disclosure information.\r \r The EDS is an outcome of the Electoral Commission of Queensland's ongoing commitment to the people of Queensland, to drive improvements to election services and meet changing community needs.\r \r To export the data from the EDS as a CSV file, consult this page: https://helpcentre.disclosures.ecq.qld.gov.au/hc/en-us/articles/115003351428-Can-I-export-the-data-I-can-see-in-the-map-\r \r For a detailed glossary of terms used by the EDS, please consult this page: https://helpcentre.disclosures.ecq.qld.gov.au/hc/en-us/articles/115002784587-Glossary-of-Terms-in-EDS\r \r For other information about how to use the EDS, please consult the FAQ page here: https://helpcentre.disclosures.ecq.qld.gov.au/hc/en-us/categories/115000599068-FAQs
Data and source code for reproducing the analysis conducted in "High Throughput FTIR Analysis of Macro and Microplastics with Plate Readers"
All materials are licensed for noncommercial purposes https://creativecommons.org/licenses/by-nc/4.0/
Explanatory_Videos.zip has videos showing data collection methods.
HIDA_Publication.R has source code for doing data cleanup and analysis on data in database.zip.
databasedata.zip holds all raw and analyzed data.
- ATR, Reflectance, and Transmission folders has all data used in the manuscript. In a raw (.0) and combined (export.csv) format for each of the plates analyzed (folder numbers).
- Plots folder has images of each spectrum.
- cell_information.csv has the raw ids and comments made at the time the particles were assessed.
- classes_reference_2.csv has the transformations used to standardize open specy's terms to polymer classes.
- CleanedSpectra_raw.csv has the total cleaned up database of all spectral intensities in long format.
- joined_cell_metadata.csv has the metadata for each plate well analyzed.
- library_metadata.csv has metadata for each spectrum in raw form for each particle id.
- Lisa_Plate_6.csv has the metadata from Lisa Roscher used in this study.
- Metadata_raw.csv has the conformed metadata that can be paired with the CleanedSpectra_raw.csv file.
- OpenSpecy_Classification_Baseline.csv has the particle metadata combined with Open Specy's classes identified after baseline correcting and smoothing the spectra with the standard Open Specy routine.
- OpenSpecy_Classification_Raw.csv has the particle metadata combined with Open Specy's identified classes if using the raw spectra.
- particle_spectrum_match.csv converts particle ids to their reference in the Polymer_Material_Database_AWI_V2_Win.xlsx file.
- Polymer_Material_Database_AWI_V2_Win.xlsx metadata on materials from Primpke's database.
- polymer_metadata_2.csv can be used to crosswalk polymer categories to more or less specific terminology.
- spread_os.csv is the reference database used in CleanedSpectra_raw.csv that has been spread to wide format.
- Top Correlation Data20221201-125621.csv is a download of results from Open Specy's beta tool that provides the top ids from the reference database.
A routine was developed in R ('bathy_plots.R') to plot bathymetry data over time during individual CEAMARC events. This is so we can analyse benthic data in relation to habitat, ie. did we trawl over a slope or was the sea floor relatively flat. Note that the depth range in the plots is autoscaled to the data, so a small range in depths appears as a scatetring of points. As long as you look at the depth scale though interpretation will be ok.
The R files need a file of bathymetry data in '200708V3_one_minute.csv' which is a file containing a data export from the underway PostgreSQL ship database and 'events.csv' which is a stripped down version of the events export from the ship board events database export. If you wish to run the code again you may need to change the pathnames in the R script to relevant locations. If you have opened the csv files in excel at any stage and the R script gets an error you may need to format the date/time columns as yyyy-mm-dd hh;mm:ss, save and close the file as csv without opening it again and then run the R script.
However, all output files are here for every CEAMARC event. Filenames contain a reference to CEAMARC event id. Files are in eps format and can be viewed using Ghostview which is available as a free download on the internet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Name: Data used to rate the relevance of each dimension necessary for a Holistic Environmental Policy Assessment. Summary: This dataset contains answers from a panel of experts and the public to rate the relevance of each dimension on a scale of 0 (Nor relevant at all) to 100 (Extremely relevant). License: CC-BY-SA Acknowledge: These data have been collected in the framework of the DECIPHER project. This project has received funding from the European Union’s Horizon Europe programme under grant agreement No. 101056898. Disclaimer: Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them. Collection Date: 2024-1 / 2024-04 Publication Date: 22/04/2025 DOI: 10.5281/zenodo.13909413 Other repositories: - Author: University of Deusto Objective of collection: This data was originally collected to prioritise the dimensions to be further used for Environmental Policy Assessment and IAMs enlarged scope. Description: Data Files (CSV) decipher-public.csv : Public participants' general survey results in the framework of the Decipher project, including socio demographic characteristics and overall perception of each dimension necessary for a Holistic Environmental Policy Assessment. decipher-risk.csv : Contains individual survey responses regarding prioritisation of dimensions in risk situations. Includes demographic and opinion data from a targeted sample. decipher-experts.csv : Experts’ opinions collected on risk topics through surveys in the framework of Decipher Project, targeting professionals in relevant fields. decipher-modelers.csv: Answers given by the developers of models about the characteristics of the models and dimensions covered by them. prolific_export_risk.csv : Exported survey data from Prolific, focusing specifically on ratings in risk situations. Includes response times, demographic details, and survey metadata. prolific_export_public_{1,2}.csv : Public survey exports from Prolific, gathering prioritisation of dimensions necessary for environmental policy assessment. curated.csv : Final cleaned and harmonized dataset combining multiple survey sources. Designed for direct statistical analysis with standardized variable names. Scripts files (R) decipher-modelers.R: Script to assess the answers given modelers about the characteristics of the models. joint.R: Script to clean and joint the RAW answers from the different surveys to retrieve overall perception of each dimension necessary for a Holistic Environmental Policy Assessment. Report Files decipher-modelers.pdf: Diagram with the result of the full-Country.html : Full interactive report showing dimension prioritisation broken down by participant country. full-Gender.html : Visualization report displaying differences in dimension prioritisation by gender. full-Education.html : Detailed breakdown of dimension prioritisation results based on education level. full-Work.html : Report focusing on participant occupational categories and associated dimension prioritisation. full-Income.html : Analysis report showing how income level correlates with dimension prioritisation. full-PS.html : Report analyzing Political Sensitivity scores across all participants. full-type.html : Visualization report comparing participant dimensions prioritisation (public vs experts) in normal and risk situations. full-joint-Country.html : Joint analysis report integrating multiple dimensions of country-based dimension prioritisation in normal and risk situations. Combines demographic and response patterns. full-joint-Gender.html : Combined gender-based analysis across datasets, exploring intersections of demographic factors and dimensions prioritisation in normal and risk situations. full-joint-Education.html : Education-focused report merging various datasets to show consistent or divergent patterns of dimensions prioritisation in normal and risk awareness. full-joint-Work.html : Cross-dataset analysis of occupational groups and their dimensions prioritisation in normal and risk situation full-joint-Income.html : Income-stratified joint analysis, merging public and expert datasets to find common trends and significant differences during dimensions prioritisation in normal and risks situations. full-joint-PS.html : Comprehensive Political Sensitivity score report from merged datasets, highlighting general patterns and subgroup variations in normal and risk situations. 5 star: ⭐⭐⭐ Preprocessing steps: The data has been re-coded and cleaned using the scripts provided. Reuse: NA Update policy: No more updates are planned. Ethics and legal aspects: Names of the persons involved have been removed. Technical aspects: Other:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource is a backup of the Water Quality Portal (WQP), a database of water quality samples from the U.S. Geological Survey and U.S. Environmental Prorection Agency. This resource includes:
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
State Harvest Data (csv)Commercial snapping turtle harvest data (in individuals) for eleven states from 1998 - 2013. States reporting are Arkansas, Delaware, Iowa, Maryland, Massachusetts, Michigan, Minnesota, New Jersey, North Carolina, Pennsylvania, and Virginia.StateHarvestData.csvInput and execution code for Colteaux_Johnson_2016Attached R file includes the code described in the listed publication. The companion JAGS (just another Gibbs sampler) code is also stored in this repository under separate cover.ColteauxJohnsonNatureConservation.RJAGS model code for Colteaux_Johnson_2016Attached R file includes the JAGS (just another Gibbs sampler) code described in the listed publication. The companion input and execution code is also stored in this repository under separate cover.ColteauxJohnsonNatureConservationJAGS.R
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# GBIF Specimen Data Analysis and Forecasting
## Version 2 - modified date ranges for figures 1 and 2 in response to reviewer comments
This repository contains the code and data for analysing and forecasting trends in Global Biodiversity Information Facility (GBIF) specimen records across three major taxonomic groups: Chordata, Arthropoda, and Plantae.
The analysis pipeline includes data cleaning, anomaly detection, primary analyses, and forecasting based on historical database snapshots.
These scripts and data correspond to analyses in the following manuscript:
Global Sampling Decline Erodes Science Potential of Natural History Collections
Authors:
Owen Forbes
Andrew G. Young
Peter H. Thrall
## Repository Structure
The repository consists of three main Quarto (.qmd) scripts and associated data files:
1. `1_DataCleaning_Forbes-et-al_2025.qmd`: Data cleaning and anomaly detection
2. `2_PrimaryAnalyses_Forbes-et-al_2025.qmd`: Primary analyses and visualisation
3. `3_SnapshotsForecasting_Forbes-et-al_2025.qmd`: Historical snapshot analysis and forecasting
## Requirements
- R (version 4.3.2 or later)
- Required R packages:
- tidyverse (v2.0.0) - for data manipulation and visualization
- readr (v2.1.5) - for reading CSV/TSV files
- ggplot2 (v3.4.0 or v3.5.0) - for creating visualizations
- rnaturalearth (v1.0.1) - for accessing natural earth map data
- dplyr (v1.1.0 or v1.1.4) - for data manipulation
- countrycode (v1.6.0) - for converting country names and codes
- spdep (v1.3-3) - for spatial dependence modeling
- sp (v1.6-0 or v2.1-3) - for spatial data manipulation
- sf (v1.0-15 or v1.0-16) - for simple features access
- data.table (v1.14.8) - for fast aggregation of large data
- lubridate (v1.9.2) - for date-time manipulation
- viridis (v0.6.3) - for color palettes
- gridExtra (v2.3) - for arranging multiple plots
- ggpubr (v0.6.0) - for creating publication-ready plots
- zoo (v1.8-12) - for time series, including moving averages
- scales (v1.3.0) - for graphical scales
- forecast (v8.22.0) - for ARIMA forecast models
- purrr (v1.0.2) - for mapping custom forecast function onto each dataset
- arrow - for working with parquet files
Install these packages before running the scripts.
## How to Use
1. Download this repository to your local machine.
2. Set your working directory to the location of the scripts.
3. Download raw datasets from GBIF (as required)
4. Ensure all required R packages are installed.
5. Run the scripts in RStudio or your preferred R environment.
### Data Cleaning (`1_DataCleaning_Forbes-et-al_2025.qmd`)
This script cleans the raw GBIF data and identifies anomalies. It produces files containing indexes of dataset records to be removed, which are used in subsequent analyses.
**Note**: The raw GBIF exported datasets for contemporary records are not included in this repository due to file size constraints. Download them from the GBIF links provided in the script and place them in the `data/` directory.
### Primary Analyses (`2_PrimaryAnalyses_Forbes-et-al_2025.qmd`)
This script performs the main analyses and generates visualisations. It uses the outputs from the data cleaning script to filter anomalous records.
To reproduce all analysis stages from the original raw .csv files:
- Start at the chunks labelled "DATA LOAD AND FILTERING".
- Run the pipeline for non-spatial analyses before spatial analyses.
- Due to memory constraints, it's recommended to run analyses for one taxonomic group and one analysis stream at a time.
To skip to plot generation:
- Navigate to sections tagged as "@! SKIP TO PLOTTING !@".
- Ensure all required analysis output files are in the `data/` directory.
### Forecasting (`3_SnapshotsForecasting_Forbes-et-al_2025.qmd`)
This script analyses historical GBIF database snapshots and forecasts future growth. It uses the cleaned snapshot data produced by the data cleaning script.
## Data Files
### GBIF Exports - Raw Data (not included on Zenodo due to file size, please download directly from GBIF)
- `0016915-240425142415019.csv` for Chordata - https://www.gbif.org/occurrence/download/0016915-240425142415019
- `0016914-240425142415019.csv` for Plantae - https://www.gbif.org/occurrence/download/0016914-240425142415019
- `0016913-240425142415019.csv` for Arthropoda - https://www.gbif.org/occurrence/download/0016913-240425142415019
### Included Data Files
#### Raw Data
- `GBIF_snapshots.parquet` # Historical snapshots RAW dataset (arrow/parquet format)
- `GBIF_integer_to_datasetKey.tsv` # Mapping old dataset IDs onto new datasetKey field
#### Contemporary Datasets - data cleaning outputs
- `chordata_counts_to_highlight_030724` # List of anomalous Chordata dataset + year indexes to filter
- `arthropoda_counts_to_highlight_OG_030724` # List of anomalous Arthropoda dataset + year indexes to filter
- `plantae_counts_to_highlight_030724` # List of anomalous Plantae dataset + year indexes to filter
#### Cleaned Snapshots
- `plantae_snapshots_filter_threshold_IN_040924` # Cleaned Plantae snapshots
- `arthropoda_snapshots_filter_threshold_IN_040924` # Cleaned Arthropoda snapshots
- `chordata_snapshots_filter_threshold_IN_040924` # Cleaned Chordata snapshots
- `gbif_dates_df_anomaly_filtered_090724` # Anomaly-filtered snapshots (combined dataset)
- `gbif_dates_df_anomalies_highlighted_090724` # Anomalies highlighted snapshots (combined dataset)
#### Analysis Outputs - for skipping straight to plot/figure generation
- `arthropoda_specimens_per_year_080724` # Arthropoda specimen counts per year
- `arthropoda_unique_species_per_year_080724` # Arthropoda unique species counts per year
- `arthropoda_grid_counts_080724` # Arthropoda grid counts
- `chordata_specimens_per_year_080724` # Chordata specimen counts per year
- `chordata_unique_species_per_year_080724` # Chordata unique species counts per year
- `chordata_grid_counts_080724` # Chordata grid counts
- `plantae_specimens_per_year_080724` # Plantae specimen counts per year
- `plantae_unique_species_per_year_080724` # Plantae unique species counts per year
- `plantae_grid_counts_080724` # Plantae grid counts
- `chordata_continent_count_080724` # Chordata continent-specific counts
- `arthropoda_continent_count_080724` # Arthropoda continent-specific counts
- `plantae_continent_count_080724` # Plantae continent-specific counts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
S1 File. SI_C01_SPD_KDE_models. R-script for analysing radiocarbon dates dates. The code performs the computation of over-regional and regional SPD and KDE models, as well as their export to CSV files (Rmd). S2 File. SI_C02_aoristic_dating. R-script for exporting aoristic time series derived from typochronological dated archaeological material as CSV files (Rmd). S3 File. SI_C03_vegetation_openness_score_example. R-script performing the computation of a vegetation openness score from pollen records and the export of the generated time series as CVS file (Rmd). S4 File. SI_C04_data_preparation. Jupyter Notebook performing the import and transformation of relevant data visualize plots exhibited in the paper (ipynb). S5 File. SI_C05_figures_extra. Jupyter Notebook visualizing the plots exhibited in the paper (ipynb). S1 Data. SI_D01_reg_data_no_dups. Spread sheet holding radiocarbon dates, with the information of laboratory identification, site name, geographical coordinates, site type, material, source and regional affiliation (csv). S2 Data. SI_D02_reg_axe_dagger_graves. Spread sheet holding entries of axes and daggers, with the information of context, site, parish, artefact identification, type, subtype, absolute dating, typochonological dating, references, geographical coordinates and regional affiliations (csv). S3 Data. SI_D03_pollen_example. Spread sheet holding sample entries of the pollen records from Krageholm (neotoma Site ID 3204) and Bjäresjöholmsjön (neotoma Site ID 3017) for example run of S3 File. Record can be access via the neotoma explorer (https://apps.neotomadb.org/explorer/) with their given IDs. Each entry holds the information of the records type, regional affiliation, absolute BP and BCE dating, as well as the counts of given plant taxa (csv). S4 Data. SI_D04_PAP_303600_TOC_LOI. Table holding sample entries of TOC content, LOI and SST reconstruction of sediment core PAP_303600 for correlations of population development with Baltic sea surface temperature. Available via 10.1594/PANGAEA.883292 (tab). S5 Data. SI_D05_vos_[…]. Spread sheets holding the vegetation openness score time series of lake Belau, Vinge, Northern Jutland and Zealand (csv). (ZIP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To replicate the analysis, the results, and the figures of the paper:
Download input data from this Zenodo repository and code from Github https://github.com/giacfalk/urban_green_space_mapping_and_tracking
Optional data extraction steps (processed output data are already available in the Zenodo repository):
Adjust your working directory
Run [lines 4-11] of workflow/sourcer.R
Run the Javascript scripts written by the string_generator_training.R and string_generator_prediction.R files in Google Earth Engine (https://code.earthengine.google.com) and complete the export to Drive tasks to generate the output .csv files
Run workflow/sourcer.R [lines 15-46] to train the ML model and make predictions (including figures and tables replication)
Age, sex and length data provide population dynamics information that can indicate how populations trends occur and may be changing. These data can help researchers estimate population growth rates, age-class distribution and population demographics. Knowing population demographics, growth rates and trends is particularly valuable to fisheries managers who must perform population assessments to inform management decisions. These data are therefore particularly important in valuable fisheries like the salmon fisheries of Alaska. This dataset includes age, sex and length data compiled from annual sampling of commercial and subsistence salmon harvests and research projects in westward and southeast Kodiak. It includes data on five salmon species: chinook, chum, coho, pink and sockeye. Age estimates were made by examining scales or bony structures (e.g. otoliths - ear bones). Scales were removed from the side of the fish; usually the left side above the lateral line. Scales or bony structures were then mounted on gummed cards and pressed on acetate to make an impression. The number of freshwater and saltwater annuli (i.e. rings) was counted to estimate age in years. Age is recorded in European Notation, which is a method of recording both fresh and saltwater annuli. For example, for a fish that spent one year in freshwater and 3 years in saltwater, its age is recorded as 1.3. The total fish age is the sum of the first and second numbers, plus one to account for the time between deposition and emergence. Therefore the fish in this example is 5 years old. Fish sex was determined by either examining external morphology (eg. head and belly shape) or internal sex organ. Length was measured in millimeters, generally from mid-eye to the fork of the tail. This data package includes the original data file (ASL DATA EXPORT.csv), a reformatting script that reformats the original data file into a consistent format (ASL_Formatting_SoutheastKodiak.R), and the reformatted dataset as a .csv file (ASL_formatted_SoutheastKodiak.csv).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
For updated crime statistics please refer to the Queensland Police Online Crime Maps website - http://www.police.qld.gov.au/online/crimemap/ which allows uses to search on a range of variables and export data in CSV format and under a Creative Commons Attribution Licence. \r \r The datasets published on this page have been provided by the Queensland Police Service under a Creative Commons Attribution 2.5 Australia Licence. To attribute this material, cite the Queensland Police Service.
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128