CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.
This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.
This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).
backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.
Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.
shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).
sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.
solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.
vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.
combined_tidy_vcf.csv: output of vcf_clean_script.R
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# General overview
This repository contains the data and code used in the analysis of the
manuscript entitled **"The hidden biodiversity knowledge split in biological collections"**.
# Context
Ecological and evolutionary processes generate biodiversity, yet how biodiversity data are organized and shared globally can shape our understanding of these processes. We show that name-bearing type specimens—the primary reference for species identity—of all freshwater and brackish fish species are predominantly housed in Global North museums, disconnected from their countries of origin. This geographical divide creates a ‘knowledge split’ with consequences for biodiversity science, particularly in the Global South, where researchers face barriers in studying native species’ name bearers housed abroad. Meanwhile, Global North collections remain flooded with non-native name bearers. We relate this imbalance to historical and socioeconomic factors, which ultimately restricts access to critical taxonomic reference materials and hinders global species documentation. To address this disparity, we call for international initiatives to promote fairer access to biological knowledge, including specimen repatriation, improved accessibility protocols for researchers in countries where specimens originated, and inclusive research partnerships.
# Repository structure
## data
This folder stores raw and processed data used to perform all the
analysis presented in this study
### raw
- `flow_period_region_country.csv` a data frame in the long format
containing the flowing of NBT per regions per per time (50-year time
frame). Variables:
- `period` numeric variable representing 50-year time intervals
- `region_type` character representing the name of the World Bank region
of the country where the NBT was sourced
- `country_type` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT was sourced
- `region_museum` character. Name of the World Bank region of the country
where the NBT is housed
- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT is housed
- `n` numeric. The number of NBT flowing from one country to another
- `spp_native_distribution.csv` data frame in the long format
containing the native composition at the country level. Variables:
- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes
- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is native to
- `region_distribution` character. The name of the region acording with
World Bank where a species is native to
- `spp_type_distribution.csv` data frame in the long format containing
the composition of NBT by country. Variables:
- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes
- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is housed
- `region_distribution` character. The name of the region acording with
World Bank where a species is housed
- `bio-dem_data.csv` data frame with data downloaded from
[Bio-Dem](https://bio-dem.surge.sh/#awards) containing information
on biological and social information at the country level. Variables:
- `country` character. A three letter code (alpha-3 ISO3166) representing
a country
- `records` numeric. Total number of species occurrence records from Global
Biodiverity Facility (GBIF)
- `records_per_area` numeric. Records per area from gbif
- `yearsSinceIndependence` numeric. Years since independence for each country
- `e_migdppc` numeric. GDP per capta
- `museum_data.csv` data frame with museums' acronyms and the world
region of each. Variables:
- `code_museum` character. The acronym (three letter code) of the museum
- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
a country
- `region_museum` character. The name of the region acording with
World Bank
### processed
- `flow_region.csv` a data frame containing flowing of name bearers among world
regions and the total number of name bearers derived from the source region
- `flow_period_region.csv` a data frame with the number of name bearers between
the world regions per 50-year time frame and the total number of name bearers
in each time frame for each world region
- `flow_period_region_prop.csv` a data frame with the number of name bearers,
the Domestic Contribution and Domestic Retention between the world
regions in a 50-year time frame - this is not used anymore in downstream analyses
- `flow_region_prop.csv` data with the total number of species flowing
between world regions, Domestic Contribution and Domestic Retention - this is no longer used in downstream analyses
- `flow_country.csv` data frame with flowing information of name bearers among
countries
- `df_country_native.csv` data frame with the number of native species
at the country level
- `df_country_type.csv` data frame with the number of name bearers at the
country level
- `df_all_beta.csv` data frame with values of endemic deficit and non-endemic
representation at the country level
## R
The letters `D`, `A` and `V` represents scripts for, respectively, data
processing (D), data analysis (A) and results visualization (V). The
script sequence to reproduce the workflow is indicated by the numbers at
the beginning of the name of the script file
- [`01_D_data_preparation.qmd`](R/01_D_data_preparation.qmd) initial data preparation
- [`02_A_beta-endemics-countries.qmd`](R/02_A_beta-endemics-countries.qmd) analysis of endemic deficit and non endemic representation. This script is used to calculate `native/endemic deficit` and `non-native/non-endemic representation`
- [`03_D_data_preparation_models.qmd`](R/03_D_data_preparation_models.qmd) script used to build data frames that will be used in statistical models ([`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd))
- [`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd) statistical models for the total number of name bearers, endemic deficit and non-endemic representation
- [`05_V_chord_diagram_Fig1.qmd`](R/05_V_chord_diagram_Fig1.qmd) code used to produce circular flow diagram. This is the Figure 1 of the study
- [`06_V_world_map_Fig1.qmd`](R/06_V_world_map_Fig1.qmd) code used to produce the world map in the Figure 1 of the main text
- [08_V_beta_endemics_Fig3.qmd](R/08_V_beta_endemics_Fig3.qmd) code used to build Figure 2 of the main text
- [`09_V_model_Fig4.qmd`](R/09_V_model_Fig4.qmd) code used to build the Figure 3 of the main text. This is the representation of the results of the models present in the script [04_A_model_NBTs.qmd](R/04_A_model_NBTs.qmd)
- [`0010_Supplementary_analysis.qmd`](R/0010_Supplementary_analysis.qmd) code to produce all the tables and figures presented in the Supplementary material of this study
## output
### Figures
In this folder you will find all figures used in the main text and supplementary material of this study
`Fig1_flow_circle_plot.png` Figure with circular plots showing the flux of name bearers among regions of the world in a 50-year time window
`Fig3_turnover_metrics_endemics.png` Cartogram with 3 maps showing the level of endemic deficit
non-endemic representation and the combination of both metrics in a combined map
`Fig4_models.png` Figure showing the predictions of the number of name bearers,
endemic deficit and non-endemic representation for different predictors.
This is derived from the statistical models
#### Supp-material
This folder contains the figures in the Supplementary material
- `FigS1_native_richness.png` World map with countries coloured according to the number of native species richness according to the Catalog of Fishes
- `FigS3_turnover_metrics.png` Cartogram with 3 maps showing the level of
native deficit, non-native representation and the combination of both metrics in a combined map
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.
This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.
This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).
backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.
Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.
shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).
sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.
solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.
vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.
combined_tidy_vcf.csv: output of vcf_clean_script.R