4 datasets found

Data and code for the manuscript - The hidden biodiversity knowledge split...
zenodo.org
Updated Apr 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Anonymous; Anonymous Anonymous (2025). Data and code for the manuscript - The hidden biodiversity knowledge split in biological collections [Dataset]. http://doi.org/10.5281/zenodo.15248066
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15248066
Dataset updated
Apr 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Anonymous; Anonymous Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 19, 2025
Description
# General overview

This repository contains the data and code used in the analysis of the
manuscript entitled **"The hidden biodiversity knowledge split in biological collections"**.

# Context

Ecological and evolutionary processes generate biodiversity, yet how biodiversity data are organized and shared globally can shape our understanding of these processes. We show that name-bearing type specimens—the primary reference for species identity—of all freshwater and brackish fish species are predominantly housed in Global North museums, disconnected from their countries of origin. This geographical divide creates a ‘knowledge split’ with consequences for biodiversity science, particularly in the Global South, where researchers face barriers in studying native species’ name bearers housed abroad. Meanwhile, Global North collections remain flooded with non-native name bearers. We relate this imbalance to historical and socioeconomic factors, which ultimately restricts access to critical taxonomic reference materials and hinders global species documentation. To address this disparity, we call for international initiatives to promote fairer access to biological knowledge, including specimen repatriation, improved accessibility protocols for researchers in countries where specimens originated, and inclusive research partnerships.

# Repository structure

## data

This folder stores raw and processed data used to perform all the
analysis presented in this study

### raw

- `flow_period_region_country.csv` a data frame in the long format
containing the flowing of NBT per regions per per time (50-year time
frame). Variables:

- `period` numeric variable representing 50-year time intervals

- `region_type` character representing the name of the World Bank region
of the country where the NBT was sourced

- `country_type` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT was sourced

- `region_museum` character. Name of the World Bank region of the country
where the NBT is housed

- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT is housed

- `n` numeric. The number of NBT flowing from one country to another

- `spp_native_distribution.csv` data frame in the long format
containing the native composition at the country level. Variables:

- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes

- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is native to

- `region_distribution` character. The name of the region acording with
World Bank where a species is native to

- `spp_type_distribution.csv` data frame in the long format containing
the composition of NBT by country. Variables:

- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes

- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is housed

- `region_distribution` character. The name of the region acording with
World Bank where a species is housed

- `bio-dem_data.csv` data frame with data downloaded from
[Bio-Dem](https://bio-dem.surge.sh/#awards) containing information
on biological and social information at the country level. Variables:

- `country` character. A three letter code (alpha-3 ISO3166) representing
a country

- `records` numeric. Total number of species occurrence records from Global
Biodiverity Facility (GBIF)

- `records_per_area` numeric. Records per area from gbif

- `yearsSinceIndependence` numeric. Years since independence for each country

- `e_migdppc` numeric. GDP per capta

- `museum_data.csv` data frame with museums' acronyms and the world
region of each. Variables:

- `code_museum` character. The acronym (three letter code) of the museum

- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
a country

- `region_museum` character. The name of the region acording with
World Bank

### processed

- `flow_region.csv` a data frame containing flowing of name bearers among world
regions and the total number of name bearers derived from the source region

- `flow_period_region.csv` a data frame with the number of name bearers between
the world regions per 50-year time frame and the total number of name bearers
in each time frame for each world region

- `flow_period_region_prop.csv` a data frame with the number of name bearers,
the Domestic Contribution and Domestic Retention between the world
regions in a 50-year time frame - this is not used anymore in downstream analyses

- `flow_region_prop.csv` data with the total number of species flowing
between world regions, Domestic Contribution and Domestic Retention - this is no longer used in downstream analyses

- `flow_country.csv` data frame with flowing information of name bearers among
countries

- `df_country_native.csv` data frame with the number of native species
at the country level

- `df_country_type.csv` data frame with the number of name bearers at the
country level

- `df_all_beta.csv` data frame with values of endemic deficit and non-endemic
representation at the country level

## R

The letters `D`, `A` and `V` represents scripts for, respectively, data
processing (D), data analysis (A) and results visualization (V). The
script sequence to reproduce the workflow is indicated by the numbers at
the beginning of the name of the script file

- [`01_D_data_preparation.qmd`](R/01_D_data_preparation.qmd) initial data preparation

- [`02_A_beta-endemics-countries.qmd`](R/02_A_beta-endemics-countries.qmd) analysis of endemic deficit and non endemic representation. This script is used to calculate `native/endemic deficit` and `non-native/non-endemic representation`

- [`03_D_data_preparation_models.qmd`](R/03_D_data_preparation_models.qmd) script used to build data frames that will be used in statistical models ([`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd))

- [`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd) statistical models for the total number of name bearers, endemic deficit and non-endemic representation

- [`05_V_chord_diagram_Fig1.qmd`](R/05_V_chord_diagram_Fig1.qmd) code used to produce circular flow diagram. This is the Figure 1 of the study

- [`06_V_world_map_Fig1.qmd`](R/06_V_world_map_Fig1.qmd) code used to produce the world map in the Figure 1 of the main text

- [08_V_beta_endemics_Fig3.qmd](R/08_V_beta_endemics_Fig3.qmd) code used to build Figure 2 of the main text

- [`09_V_model_Fig4.qmd`](R/09_V_model_Fig4.qmd) code used to build the Figure 3 of the main text. This is the representation of the results of the models present in the script [04_A_model_NBTs.qmd](R/04_A_model_NBTs.qmd)

- [`0010_Supplementary_analysis.qmd`](R/0010_Supplementary_analysis.qmd) code to produce all the tables and figures presented in the Supplementary material of this study

## output

### Figures

In this folder you will find all figures used in the main text and supplementary material of this study

`Fig1_flow_circle_plot.png` Figure with circular plots showing the flux of name bearers among regions of the world in a 50-year time window

`Fig3_turnover_metrics_endemics.png` Cartogram with 3 maps showing the level of endemic deficit
non-endemic representation and the combination of both metrics in a combined map

`Fig4_models.png` Figure showing the predictions of the number of name bearers,
endemic deficit and non-endemic representation for different predictors.
This is derived from the statistical models

#### Supp-material

This folder contains the figures in the Supplementary material

- `FigS1_native_richness.png` World map with countries coloured according to the number of native species richness according to the Catalog of Fishes

- `FigS3_turnover_metrics.png` Cartogram with 3 maps showing the level of
native deficit, non-native representation and the combination of both metrics in a combined map
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Need for speed: Short lifespan selects for increased learning ability - Data...
zenodo.org
search.dataone.org
+1more
bin
Updated Jun 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jannis Liedtke; Jannis Liedtke; Lutz Fromhage; Lutz Fromhage (2022). Need for speed: Short lifespan selects for increased learning ability - Data [Dataset]. http://doi.org/10.5061/dryad.k0p2ngf43
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.k0p2ngf43
Dataset updated
Jun 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jannis Liedtke; Jannis Liedtke; Lutz Fromhage; Lutz Fromhage
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The first dataset provides the R code for the simualiton.

The other datasets give trait values of "learning speed" ("L") for each Individual in each generation (1-200) for different lifespans (season length). From season length 1 to 800.

One dataframe ("Metapop") provide results of all 10 runs and give the mean trait value ("L") for a given sl (1-800) for each run.

One dataframe ("Meta118") provide results of all 10 runs and give the individual trait values ("L","Picky") and the individual scores for collected number of resource items ("Colsum") and sum of value of all collected resoruces ("sumS") for a given sl=118 for each run.

The dataframes can be uploaded into R by e.g.:

df<-read.csv(".../dfL_1")
Myocardial motion dataset (processed data)
figshare.com
txt
Updated Jun 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magnus Krogh (2018). Myocardial motion dataset (processed data) [Dataset]. http://doi.org/10.6084/m9.figshare.6631400.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6631400.v1
Dataset updated
Jun 25, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Magnus Krogh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
processed_data.pklProcessed myocardial motion recordings organized into python data structures. Pkl files can be loaded in python using the pickle package.analysed_data.pklMeasures extracted from the processed data organised into a pandas dataframe and saved in pickle format.analysed_data_R.csvSame data as analysed_data.pkl but exported as .csv for statistical analysis in R.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). Data and code for the manuscript - The hidden biodiversity knowledge split in biological collections [Dataset]. http://doi.org/10.5281/zenodo.15248066

Data and code for the manuscript - The hidden biodiversity knowledge split in biological collections

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.15248066

Dataset updated

Apr 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Apr 19, 2025

Description

# General overview

This repository contains the data and code used in the analysis of the
manuscript entitled **"The hidden biodiversity knowledge split in biological collections"**.

# Context

Ecological and evolutionary processes generate biodiversity, yet how biodiversity data are organized and shared globally can shape our understanding of these processes. We show that name-bearing type specimens—the primary reference for species identity—of all freshwater and brackish fish species are predominantly housed in Global North museums, disconnected from their countries of origin. This geographical divide creates a ‘knowledge split’ with consequences for biodiversity science, particularly in the Global South, where researchers face barriers in studying native species’ name bearers housed abroad. Meanwhile, Global North collections remain flooded with non-native name bearers. We relate this imbalance to historical and socioeconomic factors, which ultimately restricts access to critical taxonomic reference materials and hinders global species documentation. To address this disparity, we call for international initiatives to promote fairer access to biological knowledge, including specimen repatriation, improved accessibility protocols for researchers in countries where specimens originated, and inclusive research partnerships.

# Repository structure

## data

This folder stores raw and processed data used to perform all the
analysis presented in this study

### raw

- `flow_period_region_country.csv` a data frame in the long format
containing the flowing of NBT per regions per per time (50-year time
frame). Variables:

- `period` numeric variable representing 50-year time intervals

- `region_type` character representing the name of the World Bank region
of the country where the NBT was sourced

- `country_type` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT was sourced

- `region_museum` character. Name of the World Bank region of the country
where the NBT is housed

- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
the country of the museum where the NBT is housed

- `n` numeric. The number of NBT flowing from one country to another

- `spp_native_distribution.csv` data frame in the long format
containing the native composition at the country level. Variables:

- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes

- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is native to

- `region_distribution` character. The name of the region acording with
World Bank where a species is native to

- `spp_type_distribution.csv` data frame in the long format containing
the composition of NBT by country. Variables:

- `valid_name` character. The name of a species in the format genus_epithet
according to the Catalog of Fishes

- `country_distribution` character. Three letter code (alpha-3 ISO3166)
indicating the name of the country where a species is housed

- `region_distribution` character. The name of the region acording with
World Bank where a species is housed

- `bio-dem_data.csv` data frame with data downloaded from
[Bio-Dem](https://bio-dem.surge.sh/#awards) containing information
on biological and social information at the country level. Variables:

- `country` character. A three letter code (alpha-3 ISO3166) representing
a country

- `records` numeric. Total number of species occurrence records from Global
Biodiverity Facility (GBIF)

- `records_per_area` numeric. Records per area from gbif

- `yearsSinceIndependence` numeric. Years since independence for each country

- `e_migdppc` numeric. GDP per capta

- `museum_data.csv` data frame with museums' acronyms and the world
region of each. Variables:

- `code_museum` character. The acronym (three letter code) of the museum

- `country_museum` character. A three letter code (alpha-3 ISO3166) representing
a country

- `region_museum` character. The name of the region acording with
World Bank

### processed

- `flow_region.csv` a data frame containing flowing of name bearers among world
regions and the total number of name bearers derived from the source region

- `flow_period_region.csv` a data frame with the number of name bearers between
the world regions per 50-year time frame and the total number of name bearers
in each time frame for each world region

- `flow_period_region_prop.csv` a data frame with the number of name bearers,
the Domestic Contribution and Domestic Retention between the world
regions in a 50-year time frame - this is not used anymore in downstream analyses

- `flow_region_prop.csv` data with the total number of species flowing
between world regions, Domestic Contribution and Domestic Retention - this is no longer used in downstream analyses

- `flow_country.csv` data frame with flowing information of name bearers among
countries

- `df_country_native.csv` data frame with the number of native species
at the country level

- `df_country_type.csv` data frame with the number of name bearers at the
country level

- `df_all_beta.csv` data frame with values of endemic deficit and non-endemic
representation at the country level

## R

The letters `D`, `A` and `V` represents scripts for, respectively, data
processing (D), data analysis (A) and results visualization (V). The
script sequence to reproduce the workflow is indicated by the numbers at
the beginning of the name of the script file

- [`01_D_data_preparation.qmd`](R/01_D_data_preparation.qmd) initial data preparation

- [`02_A_beta-endemics-countries.qmd`](R/02_A_beta-endemics-countries.qmd) analysis of endemic deficit and non endemic representation. This script is used to calculate `native/endemic deficit` and `non-native/non-endemic representation`

- [`03_D_data_preparation_models.qmd`](R/03_D_data_preparation_models.qmd) script used to build data frames that will be used in statistical models ([`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd))

- [`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd) statistical models for the total number of name bearers, endemic deficit and non-endemic representation

- [`05_V_chord_diagram_Fig1.qmd`](R/05_V_chord_diagram_Fig1.qmd) code used to produce circular flow diagram. This is the Figure 1 of the study

- [`06_V_world_map_Fig1.qmd`](R/06_V_world_map_Fig1.qmd) code used to produce the world map in the Figure 1 of the main text

- [08_V_beta_endemics_Fig3.qmd](R/08_V_beta_endemics_Fig3.qmd) code used to build Figure 2 of the main text

- [`09_V_model_Fig4.qmd`](R/09_V_model_Fig4.qmd) code used to build the Figure 3 of the main text. This is the representation of the results of the models present in the script [04_A_model_NBTs.qmd](R/04_A_model_NBTs.qmd)

- [`0010_Supplementary_analysis.qmd`](R/0010_Supplementary_analysis.qmd) code to produce all the tables and figures presented in the Supplementary material of this study

## output

### Figures

In this folder you will find all figures used in the main text and supplementary material of this study

`Fig1_flow_circle_plot.png` Figure with circular plots showing the flux of name bearers among regions of the world in a 50-year time window

`Fig3_turnover_metrics_endemics.png` Cartogram with 3 maps showing the level of endemic deficit
non-endemic representation and the combination of both metrics in a combined map

`Fig4_models.png` Figure showing the predictions of the number of name bearers,
endemic deficit and non-endemic representation for different predictors.
This is derived from the statistical models

#### Supp-material

This folder contains the figures in the Supplementary material

- `FigS1_native_richness.png` World map with countries coloured according to the number of native species richness according to the Catalog of Fishes

- `FigS3_turnover_metrics.png` Cartogram with 3 maps showing the level of
native deficit, non-native representation and the combination of both metrics in a combined map

Clear search

Close search

Google apps

Main menu

Data and code for the manuscript - The hidden biodiversity knowledge split...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Need for speed: Short lifespan selects for increased learning ability - Data...

Myocardial motion dataset (processed data)

Data and code for the manuscript - The hidden biodiversity knowledge split in biological collections