Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This dataset contains all documents, the text and the pdf files, as well as the code that was used to carry out the term analysis of agriculturally relevant organisms in GBIF. The Global Biodiversity Information Facility (GBIF) is an international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. The National Agricultural Library Thesaurus (NALT) has online vocabulary tools of agricultural terms. My task was to use the agricultural terms from the NALT and analyze the agriculturally relevant organisms in GBIF. Some of the goals were:
Get descriptive statistics about Agrobiodiversity Data (AgData) in GBIF Create visualizations to view occurrence trends of the GBIF corpus and AgData in GBIF to determine gaps or biases. Provide examples of and code for how agricultural researchers can work with GBIF data.
Details about the process and the methodologies used to carry out this analysis I started off with trying to extract names from the Agricultural Thesaurus. I encountered some problems trying to extract names using the RDF format in the Thesaurus. An employee at the Library later provided me with the names in the Thesaurus in a text file. I then proceeded to extract the scientific names from that text file to run them through the GBIF API. Since there were so many of the names, the API would throw a connection error. The API can handle only so many requests in a particular interval of time. To handle this, I leveraged exception handling in Python. Every time the API threw an error, I told the script to wait for 5 seconds and then resume sending requests. Although this took a lot of time, it allowed me to get data such as year of occurrence, coordinate values about the ag relevant data from the API.
Technology
I used Python because it is has support for both web scraping and data analysis, both of which were needed for this project. I used Jupyter notebooks, run through Anaconda. Project Jupyter is a non-profit, open-source project that supports interactive data analysis and scientific computing. It allows users to code right in our browser and eliminates the need to install any other Integrated Development Environment, and also makes it very convenient to share our code. The main packages used in this project are pandas for data manipulation, requests and json to interact with the GBIF API, NumPy which adds support for array and matrix operations and more. Tableau and matplotlib has been used to create visualizations after performing the analysis in Python. Resources in this dataset:Resource Title: Code. File Name: Code.zipResource Description: This zip file contains multiple Jupyter notebooks that contain the code for all the analysis.Resource Software Recommended: Jupyter notebook,url: http://jupyter.org/ Resource Title: Visualizations. File Name: Visualizations.zipResource Description: This zip file contains Tableau workbooks for the visualizations.Resource Software Recommended: Tableau,url: https://www.tableau.com/ Resource Title: Corpus. File Name: Corpus.zipResource Description: This zip file contains the two datasets of family Apidae and Reduviidae.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R script to clean raw GBIF records, perform Getis-Ord Gi* analysis, and create maps. The vector shapefile including the number total clean GBIF records per one-degree squared grid cell is also included here.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
gbif-plants-raw
A large-scale dataset of 96.1 million research-grade plant observations sourced from iNaturalist Open Data and aligned with GBIF taxonomy. Each row contains species metadata, taxonomic identifiers, geolocation, event timing, dataset source info, and a direct image URL. This dataset is designed for large-scale image classification, biodiversity modelling, and pretraining work.
Dataset Summary
This dataset aggregates all research-grade Plantae… See the full description on the dataset page: https://huggingface.co/datasets/juppy44/gbif-plants-raw.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Species occurrence records for native and non-native bees, wasps and other insects collected using mainly pan, malaise, and vane trapping; and insect netting methods in Canada, Mexico, the non-contiguous United States, U.S. Territories (specifically U.S. Virgin Islands), U.S. Minor Outlying Islands and other global locations with the bulk of the specimens coming from the Eastern United States often from Federal lands such as USFWS, NPS, DOD, USFS. Some records also contain notes regarding plants or substrates from which insects were collected or that were present and/or in flower at the time the insects were collected. Unless otherwise noted, taxonomic determinations (identifications) were completed by Sam Droege (USGS Eastern Ecological Science Center- EESC, Native Bee Laboratory) and Clare Maffei (USFWS, Inventory and Monitoring Branch).
The EESC Native Bee Lab currently keeps only a small synoptic collection, rare and voucher specimens are deposited in the Smithsonian National Collection (NMNH) and widely distributed to other institutions for DNA, revisions, and augmentation of existing collections. Surplus specimens are also made available to students to learn their identifications. Corrections to any of our determinations are always welcomed. Common species that are not in demand for surplus are usually destroyed and the pins recycled. Recent revisions to Lasioglossum, Ceratina, and to a much lesser extent Triepeolus and Epeolus and other small groups have rendered determinations prior to those revisions out of date for species involved in name changes and users should account for that during analyses. Current data (included information on specimen codes without identifications) are always available without charge directly from Sam Droege.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GBIF Data Backbone File -- Smithsonian Gap Analysis Tool; Data download of the GBIF database (https://www.gbif.org/) formatted for use in the Smithsonian Gap Analysis tool
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Negative control lane for ancient DNA taken from core MV1012 46.9
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset containing 32117 species occurrences available in GBIF matching the query: Depth: 80m TaxonKey: Pimelodella australis Eigenmann, 1917. The dataset includes 32117 records from 561 constituent datasets: 16 records from Field Museum of Natural History (Zoology) Invertebrate Collection. 63 records from macnin. 111 records from Mollusca of Costa Rica (INBio). 581 records from The molluscs collection (IM) of the Muséum national d'Histoire naturelle (MNHN - Paris). 2 records from (Appendix 1) Coral analysis and isostatic rebound effects from different Holes of IODP Expedition 310. 1 records from ZUEC-OPH - Coleção de Ophiuroidea do Museu de Zoologia da UNICAMP. 46 records from (Table 5) Distribution of Miocene planktonic foraminifers in sediments of ODP Hole 138-848B. 71 records from Mollusca collection of National Museum of Nature and Science. 2 records from Zoology (Museum of Evolution - Uppsala). 53 records from TCWC Marine Invertebrates. 2 records from (Table 2) Oligocene to Pliocene nannofossil range chart for ODP Hole 134-828A. 128 records from CSIRO Ichthyology provider for OZCAM. 729 records from Fishbase. 41 records from Occurrence records of southern African aquatic biodiversity. 22 records from Biological Reference Collections ICM CSIC. 5 records from ZUEC-GAS - Coleção de Gastropoda do Museu de Zoologia da UNICAMP. 1 records from (Table 5) Oligocene to Pleistocene nannofossil range chart for ODP Hole 134-829A. 204 records from Abundance of megabenthic species in trawl catches per station in addition to table 2 during POLARSTERN cruise ARK-VIII/2 (EPOS). 4 records from (Table 7) Abundance of silicoflagellates and ebridians in selected samples from ODP Hole 138-850B. 3 records from UAM Fish Collection (Arctos). 2 records from Antarctic Porifera database from the Spanish benthic expeditions: Bentart, Gebrap and Ciemar. 6 records from Colección de Crustáceos Decápodos y Estomatópodos del Centro Oceanográfico de Cádiz: CCDE-IEOCD. 3 records from (Appendix B) Nannofossil abundance in ODP Hole 165-998A sediments. 526 records from Museum Victoria provider for OZCAM. 10 records from (Table 4) Relative abundances of stratigraphically useful planktonic foraminifers from ODP Site 167-1013 sediments. 12 records from Flora of tanzania. 2 records from CSIRO, Benthic Plant Invertebrate and Fish Biodiversity, Great Barrier Reef, Northeast Australia, 2003-2006. 143 records from BMSM Bailey-Matthews National Shell Museum. 652 records from CSIRO, Soviet Fishery Data, Australia, 1965-1978. 142 records from CAS Invertebrate Zoology (IZ). 3 records from Microplankton abundance measured on water bottle samples during AEGAEO cruise LIA-8. 1 records from Freshwater plants of Cameroon. 36 records from Colección Nacional de Foraminíferos - Museo Argentino de Ciencias Naturales 'Bernardino Rivadavia'. 9 records from Colección de Invertebrados Cenpat (CNP-INV). 203 records from Norwegian Biodiversity Information Centre - Other datasets. 2 records from ZUEC-BIV - Coleção de Bivalvia do Museu de Zoologia da UNICAMP. 658 records from Museum of Comparative Zoology, Harvard University. 55 records from (Table 4) Distribution of benthic foraminifers from ODP Hole 134-829A. 32 records from (Appendix G) Benthic foraminifera extinction group species in ODP Hole 167-1012B. 6 records from (Supplement 1) Most common taxa or groups of coccoliths from ODP Site 1233 (past dataset) covering the last 70 kyr. 2 records from Arthropoda Collection of the Seto Marine Biological Laboratory, Kyoto University. 2 records from Cnidaria Collection of the Seto Marine Biological Laboratory, Kyoto University. 1 records from Coleção de Polychaeta do Museu Nacional. 381 records from Invertebrates Collection of the Swedish Museum of Natural History. 9 records from KUBI Ichthyology Tissue Collection. 1 records from University of California Museum of Paleontology. 3571 records from NMNH occurrence DwC-A. 62 records from Colección Nacional de Invertebrados - Museo Argentino de Ciencias Naturales 'Bernardino Rivadavia'. 22 records from Counting of planktic foraminifera of ODP Hole 182-1129C. 6 records from (Table AT1) Grain size distribution and stable isotope record of benthic foraminifera of ODP Hole 175-1085A. 614 records from CSIRO, Marine Data Warehouse Biology Records Pre-1998, Australia, 1978-1997. 2 records from (Table AT3) Planktonic foraminiferal stratigraphy of ODP Hole 182-1128B. 10 records from Stable oxygen isotope ratios of foraminifera from late middle Eocene sediments of ODP Site 171-1052 from Blake Plateau, West Atlantic Ocean (Appendix). 11 records from Occurrence of planktic foraminifera in Pliocene to Holocene sediments of DSDP Hole 81-552A in the North Atlantic (Appendix 2). 54 records from Total foraminifera counts of multinet M6/7_MSN100. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/009-6. 1 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/011-3. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/012-4. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/014-6. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/041-5. 1 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/042-5. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/045-9. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/046-5. 1 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/048-5. 1 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/049-5. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/061-3. 1 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/088-7. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/090-2. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/091-3. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/092-6. 3 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/107-6. 2 records from Large protozoa abundance measured on concentrated water bottle samples at station PS58/108-3. 6 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/424-22. 6 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/427-6. 7 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/508-22. 6 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/509-16. 10 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/511-12. 5 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/514-18. 6 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/543-5. 7 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/544-6. 9 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/546-19. 8 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/553-11. 6 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/570-14. 7 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/580-12. 6 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/587-14. 7 records from Microzooplankton (larger protists and small copepods) abundance measured on concentrated water bottle samples at station PS65/593-9. 6 records from Large protozoan and small metazoan abundance measured on concentrated water bottle samples at station PS65/590-1. 4 records from Colección Ictiológica del CENPAT-CONICET. 34 records from Spores and dinoflagellates of Site 175-1075. 13 records from Abundance of microzooplankton measured on water bottle samples during POLARSTERN cruise ANT-X/6. 2 records from Biological data measured on water bottle sampels at station AT_II-119/5_35-4. 1 records from Microzooplankton abundance and biomass at station TT050_13-14. 1 records from Microzooplankton abundance and biomass at station TT050_17-4. 1 records from Microzooplankton abundance and biomass at station TT050_21-13. 1 records from Microzooplankton abundance and biomass at station TT054_13-18. 1 records from Abundance, biovolume and biomass of heterotrophic dinoflagellates at station TT007_1-CTD8. 1 records from Abundance, biovolume and biomass of heterotrophic dinoflagellates at station TT007_10-CTD124. 1 records from Abundance, biovolume and biomass of heterotrophic
Facebook
TwitterThe purpose of this dataset is to evaluate the impact of fires on reptile and amphibian biodiversity in California's southwest desert. Species data was downloaded from the Global Diversity Information Facility (GBIF). GBIF.org (28 July 2021) GBIF Occurrence Download https://doi.org/10.15468/dl.6kvrr7
Facebook
TwitterThe purpose of this dataset is to evaluate the impact of fires on endangered species biodiversity in California's southwest desert. Species data was downloaded from the Global Diversity Information Facility (GBIF). Wildland fires were downloaded from the National Interagency Fire Network
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for gbif.org as of November 2025
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Southern Ocean is a remote, hostile environment where conducting marine biology is challenging, so we know relatively little about this important region, which is critical as a habitat for breeding and foraging of many marine endotherms. Scientists from around the world have been tracking seals, penguins, petrels, whales and albatrosses for more than two decades to learn how they spend their time at sea. The Retrospective Analysis of Antarctic Tracking Data (RAATD), was initiated by the SCAR Expert Group on Marine Mammals (EG-BAMM) in 2010. This team has assembled tracking data shared by 38 biologists from 11 different countries to accumulate the largest animal tracking database in the world, containing information from 15 species, containing over 3,400 individual animals and almost 2.5 million at-sea locations. Analysing a dataset of this size brings its own challenges and the team is developing new and innovative statistical approaches to integrate these complex data. When complete RAATD will provide a greater understanding of fundamental ecosystem processes in the Southern Ocean, help predict the future of top predator distribution and help with spatial management planning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# GBIF Specimen Data Analysis and Forecasting
This repository contains the code and data for analysing and forecasting trends in Global Biodiversity Information Facility (GBIF) specimen records across three major taxonomic groups: Chordata, Arthropoda, and Plantae.
The analysis pipeline includes data cleaning, anomaly detection, primary analyses, and forecasting based on historical database snapshots.
These scripts and data correspond to analyses in the following manuscript:
Global Sampling Decline Erodes Science Potential of Natural History Collections
Authors:
Owen Forbes
Andrew G. Young
Peter H. Thrall
## Repository Structure
The repository consists of three main Quarto (.qmd) scripts and associated data files:
1. `1_DataCleaning_Forbes-et-al_2024.qmd`: Data cleaning and anomaly detection
2. `2_PrimaryAnalyses_Forbes-et-al_2024.qmd`: Primary analyses and visualisation
3. `3_SnapshotsForecasting_Forbes-et-al_2024.qmd`: Historical snapshot analysis and forecasting
## Requirements
- R (version 4.3.2 or later)
- Required R packages:
- tidyverse (v2.0.0) - for data manipulation and visualization
- readr (v2.1.5) - for reading CSV/TSV files
- ggplot2 (v3.4.0 or v3.5.0) - for creating visualizations
- rnaturalearth (v1.0.1) - for accessing natural earth map data
- dplyr (v1.1.0 or v1.1.4) - for data manipulation
- countrycode (v1.6.0) - for converting country names and codes
- spdep (v1.3-3) - for spatial dependence modeling
- sp (v1.6-0 or v2.1-3) - for spatial data manipulation
- sf (v1.0-15 or v1.0-16) - for simple features access
- data.table (v1.14.8) - for fast aggregation of large data
- lubridate (v1.9.2) - for date-time manipulation
- viridis (v0.6.3) - for color palettes
- gridExtra (v2.3) - for arranging multiple plots
- ggpubr (v0.6.0) - for creating publication-ready plots
- zoo (v1.8-12) - for time series, including moving averages
- scales (v1.3.0) - for graphical scales
- forecast (v8.22.0) - for ARIMA forecast models
- purrr (v1.0.2) - for mapping custom forecast function onto each dataset
- arrow - for working with parquet files
Install these packages before running the scripts.
## How to Use
1. Download this repository to your local machine.
2. Set your working directory to the location of the scripts.
3. Download raw datasets from GBIF (as required)
4. Ensure all required R packages are installed.
5. Run the scripts in RStudio or your preferred R environment.
### Data Cleaning (`1_DataCleaning_Forbes-et-al_2024.qmd`)
This script cleans the raw GBIF data and identifies anomalies. It produces files containing indexes of dataset records to be removed, which are used in subsequent analyses.
**Note**: The raw GBIF exported datasets for contemporary records are not included in this repository due to file size constraints. Download them from the GBIF links provided in the script and place them in the `data/` directory.
### Primary Analyses (`2_PrimaryAnalyses_Forbes-et-al_2024.qmd`)
This script performs the main analyses and generates visualisations. It uses the outputs from the data cleaning script to filter anomalous records.
To reproduce all analysis stages from the original raw .csv files:
- Start at the chunks labelled "DATA LOAD AND FILTERING".
- Run the pipeline for non-spatial analyses before spatial analyses.
- Due to memory constraints, it's recommended to run analyses for one taxonomic group and one analysis stream at a time.
To skip to plot generation:
- Navigate to sections tagged as "@! SKIP TO PLOTTING !@".
- Ensure all required analysis output files are in the `data/` directory.
### Forecasting (`3_SnapshotsForecasting_Forbes-et-al_2024.qmd`)
This script analyses historical GBIF database snapshots and forecasts future growth. It uses the cleaned snapshot data produced by the data cleaning script.
## Data Files
### GBIF Exports - Raw Data (not included on Zenodo due to file size, please download directly from GBIF)
- `0016915-240425142415019.csv` for Chordata - https://www.gbif.org/occurrence/download/0016915-240425142415019
- `0016914-240425142415019.csv` for Plantae - https://www.gbif.org/occurrence/download/0016914-240425142415019
- `0016913-240425142415019.csv` for Arthropoda - https://www.gbif.org/occurrence/download/0016913-240425142415019
### Included Data Files
#### Raw Data
- `GBIF_snapshots.parquet` # Historical snapshots RAW dataset (arrow/parquet format)
- `GBIF_integer_to_datasetKey.tsv` # Mapping old dataset IDs onto new datasetKey field
#### Contemporary Datasets - data cleaning outputs
- `chordata_counts_to_highlight_030724` # List of anomalous Chordata dataset + year indexes to filter
- `arthropoda_counts_to_highlight_OG_030724` # List of anomalous Arthropoda dataset + year indexes to filter
- `plantae_counts_to_highlight_030724` # List of anomalous Plantae dataset + year indexes to filter
#### Cleaned Snapshots
- `plantae_snapshots_filter_threshold_IN_040924` # Cleaned Plantae snapshots
- `arthropoda_snapshots_filter_threshold_IN_040924` # Cleaned Arthropoda snapshots
- `chordata_snapshots_filter_threshold_IN_040924` # Cleaned Chordata snapshots
- `gbif_dates_df_anomaly_filtered_090724` # Anomaly-filtered snapshots (combined dataset)
- `gbif_dates_df_anomalies_highlighted_090724` # Anomalies highlighted snapshots (combined dataset)
#### Analysis Outputs - for skipping straight to plot/figure generation
- `arthropoda_specimens_per_year_080724` # Arthropoda specimen counts per year
- `arthropoda_unique_species_per_year_080724` # Arthropoda unique species counts per year
- `arthropoda_grid_counts_080724` # Arthropoda grid counts
- `chordata_specimens_per_year_080724` # Chordata specimen counts per year
- `chordata_unique_species_per_year_080724` # Chordata unique species counts per year
- `chordata_grid_counts_080724` # Chordata grid counts
- `plantae_specimens_per_year_080724` # Plantae specimen counts per year
- `plantae_unique_species_per_year_080724` # Plantae unique species counts per year
- `plantae_grid_counts_080724` # Plantae grid counts
- `chordata_continent_count_080724` # Chordata continent-specific counts
- `arthropoda_continent_count_080724` # Arthropoda continent-specific counts
- `plantae_continent_count_080724` # Plantae continent-specific counts
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A vigorous debate among ecologists concerns two contrasting theories of species distribution and diversity, the niche theory and the neutral theory. The 'continuum hypothesis', supported by modelling results, maintains that rather than being mutually exclusive, these theories represent two ends of a continuum. Here we develop the first empirical test capable of distinguishing between these three theories using continental-scale occurrence data from GBIF and a novel simulation framework of corresponding virtual species; application of this test to a set of 84 Australian mammals supported the continuum hypothesis over the two competing theories.
Repository contains:
- Manuscript supplementary information (Sp.Dis_F1000-Supplementary.pdf)
- All analysis data and code (analysis_data_and_code.zip)
- GBIF raw data in a DwC-A format (0054618-160910150852091.zip). Data is also publicly available via GBIF, with the following DOI: https://doi.org/10.15468/dl.3poqxs
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CSV that contains 1.000 GBIF observations of animals than have been involved in wildlife–vehicle collision on interurban roads in Spain and a buffer of each species distribution calculated with these data. If you are interested in the whole country, please do not hesitate to contact me and I will forward it to you.
Each record describes the observation by the following fields:
gbifid: the unique identifier for an occurrence record in GBIF.
datasetkey: the local dataset id within the GBIF network.
occurrenceid: a unique identifier for the occurrence, allowing the same occurrence to be recognized across dataset versions as well as through data downloads and use.
kingdom: the full scientific name specifying the kingdom that the occurrence's scientific name is classified under.
phylum: the full scientific name of the phylum or division in which the taxon is classified.
class: the full scientific name of the class in which the taxon is classified.
order: the full scientific name of the order in which the taxon is classified.
family: the full scientific name of the family in which the taxon is classified.
genus: the full scientific name of the genus in which the taxon is classified.
species: species classification key.
infraspecificepithet: the name of the lowest or terminal infraspecific epithet of the scientificName, excluding any rank designation.
taxonrank: the taxonomic rank of the supplied scientific name.
scientificname: the full scientific name of the organism, to the lowest level taxonomic rank that is possible to supply, and including authorship and year of the name where applicable.
verbatimscientificname: the taxonomic rank of the most specific name in the scientificName as it appears in the original record.
verbatimscientificnameauthorship: non described.
countrycode: a two-letter standard abbreviation for the country of the occurrence locality.
locality: the specific description of the place.
stateprovince: the name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the Location occurs.
occurrencestatus: a statement about the presence or absence of a Taxon at a Location.
individualcount: to record the quantity of a species occurrence, e.g. as the number of individuals, percentage of vegetation coverage, or the biomass .
publishingorgkey: the publishing organization key (a uuid).
decimallatitude: the geographic latitude, resp., in decimal degrees.
decimallongitude: the geographic longitude, resp., in decimal degrees.
coordinateuncertaintyinmeters: the horizontal distance from the given decimalLatitude and decimalLongitude in meters, describing the smallest circle containing the whole of the Location.
coordinateprecision: a decimal representation of the precision of the coordinates given in the decimalLatitude and decimalLongitude.
elevation: elevation (altitude) in meters above sea level. Supports range queries.
elevationaccuracy: non described.
depth: depth in meters relative to altitude. For example 10 meters below a lake surface with given altitude. Supports range queries.
depthaccuracy: non described.
eventdate: the date or date interval during which the occurrence record was collected, following ISO 8601 date-time standard.
day: the integer day of the month on which the Event occurred.
month: the integer month in which the Event occurred.
year: the four-digit year in which the Event occurred, according to the Common Era Calendar.
taxonkey: a taxon key from the GBIF backbone.
specieskey: species classification key.
basisofrecord: the type of the individual record, e.g. observation, physical specimen, fossil, living ex-situ, culture collection specimen.
institutioncode: the name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record.
collectioncode: the name, acronym, coden, or initialism identifying the collection or data set from which the record was derived.
catalognumber: an identifier (preferably unique) for the record within the data set or collection.
recordnumber: an identifier given to the Occurrence at the time it was recorded. Often serves as a link between field notes and an Occurrence record, such as a specimen collector's number.
identifiedby: a list (concatenated and separated) of names of people, groups, or organizations who assigned the Taxon to the subject.
dateidentified: the date on which the subject was determined as representing the Taxon.
license: a machine-readable statement of the rights assigned to the published dataset.
rightsholder: a person or organization owning or managing rights over the resource.
recordedby: the name of the institution or organization listed as the data publisher on GBIF.org.
typestatus: a list (concatenated and separated) of nomenclatural types (type status, typified scientific name, publication) applied to the subject.
establishmentmeans: The process by which the biological individual(s) represented in the Occurrence became established at the location.
lastinterpreted: this date the record was last modified in GBIF, in ISO 8601 format: yyyy, yyyy-MM, yyyy-MM-dd, or MM-dd.
mediatype: the kind of multimedia associated with an occurrence as defined in GBIF MediaType enum
issue: a specific interpretation issue as defined in GBIF OccurrenceIssue enum.
geom (geometry): geometry from latitude and longitude position. Developed for this project.
buff (geometry): buffer around 'geom' taking into account 'coordinateuncertaintyinmeters' and 'coordinateprecision'. Developed for this project.
The context is the Final Master's Degree Project 'Analysis and Predictive Modelling of Wildlife–Vehicle Collision on Interurban Roads in Spain' (Data Science Master’s Degree of Universitat Oberta de Catalunya - UOC).
This dataset is the output of the animal analysis and the code repository is available on GitHub.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A biodiversity dataset graph: GBIF, iDigBio, BioCASe hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd6463d252bff37b55 hash://md5/898a9c02bedccaea5434ee4c6d64b7a2
The intended use of this archive is to facilitate meta-analysis of the Global Biodiversity Information Facility, Integrated Digitized Biocollections, Biological Collection Access Service (GBIF, iDigBio, BioCASe). GBIF, iDigBio and BioCASe help provide access to biological data collections.
This dataset provides versioned provenance logs of snapshots of the GBIF, iDigBio, BioCASe network as tracked by Preston [2] between 2018-09-03 and 2023-02-02 using "preston update -u https://gbif.org,https://idigbio.org,http://biocase.org".
This publication contains two types of files: index files and provenance logs. Associated data files are hosted elsewhere for pragmatic reasons. Index files provide a way to link provenance files in time to establish a versioning mechanism. Provenance logs describe how, when, what and where the GBIF, iDigBio, BioCASe content was retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543 .
To retrieve and verify the downloaded GBIF, iDigBio, BioCASe biodiversity dataset graph, use the preston[2] command-line tool to "clone" this dataset using:
$ java -jar preston.jar ls --remote https://zenodo.org/record/7651831/files > /dev/null
Optionally, you can retrieve all associated data (>500GB) files using:
$ java -jar preston.jar clone https://zenodo.org/record/7651831/files --remote https://zenodo.org/record/7651831/files,https://linker.bio,https://archive.org/download/biodiversity-dataset-archives/data.zip/data/
Please note https://archive.org/download/biodiversity-dataset-archives/data.zip/data/ and https://linker.bio are Preston remotes that provided access to GBIF, iDigBio, BioCASe data files at time of writing (17 Feb 2023). These remotes can replaced with any other Preston remote(s) if needed. This may take a while depending on network speed and hardware constraints. See also https://archive.org/details/biodiversity-dataset-archives .
After that, verify the index of the archive by reproducing the following provenance log history:
$ java -jar preston.jar history
hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd6463d252bff37b55 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/1621a777e4a7442e9864424820c5f825d9cf1c65599cbfbbda039384f1b74ada .
hash://sha256/1621a777e4a7442e9864424820c5f825d9cf1c65599cbfbbda039384f1b74ada http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/b1c11d9231768def925b9d076c1c4b711a727326ad99e62982aa4ede288e5aa2 .
hash://sha256/b1c11d9231768def925b9d076c1c4b711a727326ad99e62982aa4ede288e5aa2 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/125e67d7c5077af8fa958569644d61e44a39bbbbdaaf16af0430dcf441e05cec .
hash://sha256/125e67d7c5077af8fa958569644d61e44a39bbbbdaaf16af0430dcf441e05cec http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/63fae82e8a3aacd11e4a06b5736242aabe40802c6259a38de066de14848e3718 .
hash://sha256/63fae82e8a3aacd11e4a06b5736242aabe40802c6259a38de066de14848e3718 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/7cd305e9d275763c96e7685847460fcc381b5c97c1460c00441f663c1788800f .
hash://sha256/7cd305e9d275763c96e7685847460fcc381b5c97c1460c00441f663c1788800f http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/38e8e17f6742d39379b96cec2d4e70a5a63a85a28aee49727031c9061f4b1e03 .
hash://sha256/38e8e17f6742d39379b96cec2d4e70a5a63a85a28aee49727031c9061f4b1e03 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 .
hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/aab08c5c87ce6a8f400972e2b09b7fa3421947b59407a8feb98388d7e42b49e8 .
hash://sha256/aab08c5c87ce6a8f400972e2b09b7fa3421947b59407a8feb98388d7e42b49e8 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/f449a34dd80d4e33248a1a7cb0d0fa2b8dac49865a0a32ed5bbaacb22addb0d1 .
hash://sha256/f449a34dd80d4e33248a1a7cb0d0fa2b8dac49865a0a32ed5bbaacb22addb0d1 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/b771ed09aea78055e39d5c955997e5d9b42dd9edc6b094d9b8a27df16bdc6b6c .
hash://sha256/b771ed09aea78055e39d5c955997e5d9b42dd9edc6b094d9b8a27df16bdc6b6c http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/15edbac974fb77347e07cda76358f7f662dd800bfc5b3e476fc66ecdc6203d03 .
hash://sha256/15edbac974fb77347e07cda76358f7f662dd800bfc5b3e476fc66ecdc6203d03 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913 .
hash://sha256/4000d2a1af6da5b46f374038d884f91768782a1905d4a75fff3c8c3bb6629913 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/f6d133620a665569a13a3fb7ca31b163bf849864812d447238994226d35e3253 .
hash://sha256/f6d133620a665569a13a3fb7ca31b163bf849864812d447238994226d35e3253 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/e10c234ac54f02fd63da87b418f36428b876d91a30a42a4657e1726ba862b900 .
hash://sha256/e10c234ac54f02fd63da87b418f36428b876d91a30a42a4657e1726ba862b900 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/83b4553edfc58e6389d427a08de533236e6a7eeb39b61239d225b0d4188d8c84 .
hash://sha256/83b4553edfc58e6389d427a08de533236e6a7eeb39b61239d225b0d4188d8c84 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/49e079a0bac47ca17c0b14fa711b7742b9332ac64e1866adf13d294692720f9f .
hash://sha256/49e079a0bac47ca17c0b14fa711b7742b9332ac64e1866adf13d294692720f9f http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/000cd23a8494a8a18f8b552e7f113af418eb2ae85e9908f61f44c720ce70608b .
hash://sha256/000cd23a8494a8a18f8b552e7f113af418eb2ae85e9908f61f44c720ce70608b http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/1cce63489aa8618dcaf19ce2cd6166a7ba801798b235a25a725397d38c2fe957 .
hash://sha256/1cce63489aa8618dcaf19ce2cd6166a7ba801798b235a25a725397d38c2fe957 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417 .
hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/810b22c16e1a3911c6eecfca348758d3ffd5b29fc36990015cda6427bdde2233 .
hash://sha256/810b22c16e1a3911c6eecfca348758d3ffd5b29fc36990015cda6427bdde2233 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/dd16f4bae9a02ce71bc3ba4da2809cc5035743a4e23f61f5631f69b08d0e40f5 .
hash://sha256/dd16f4bae9a02ce71bc3ba4da2809cc5035743a4e23f61f5631f69b08d0e40f5 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/bf676e2ce4164f8148a793188650f07c464dc52b2bfc07e92c9f16041baba8d5 .
hash://sha256/bf676e2ce4164f8148a793188650f07c464dc52b2bfc07e92c9f16041baba8d5 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/1fd3e156c6ba1632a27b2bebaea36f76afeac8dfecf530d772988832821304ea .
hash://sha256/1fd3e156c6ba1632a27b2bebaea36f76afeac8dfecf530d772988832821304ea http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/4dbee404b74775cac279e0e7fbc1aa72dddfc70df02b07b9a2f82023dccd4732 .
hash://sha256/4dbee404b74775cac279e0e7fbc1aa72dddfc70df02b07b9a2f82023dccd4732 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/1f25e9a78ad0630ead9676807269185761f0d23544a4492a0337c2d306b10686 .
hash://sha256/1f25e9a78ad0630ead9676807269185761f0d23544a4492a0337c2d306b10686 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/20b07c79ff0d48c818e2882816948ed192d5c86bdff2118881d7446b15e63bf1 .
hash://sha256/20b07c79ff0d48c818e2882816948ed192d5c86bdff2118881d7446b15e63bf1 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/6b7bc4dc5901a459663f47628768b53622eda36bd0fa092390c6d1c0323abf6d .
hash://sha256/6b7bc4dc5901a459663f47628768b53622eda36bd0fa092390c6d1c0323abf6d http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/d98e3bd2bc717bc11a3338cd43fc488bde1d96cb42d8cbe8301f0d9f9753007f .
hash://sha256/d98e3bd2bc717bc11a3338cd43fc488bde1d96cb42d8cbe8301f0d9f9753007f http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/1d184ff657913a77d50b9f33b5bd1f483220fd83f26dbf02c020f98c778aafae .
hash://sha256/1d184ff657913a77d50b9f33b5bd1f483220fd83f26dbf02c020f98c778aafae http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/a35ec845ec71a2951652d70e574e6280c843879efc3b1639e9ccdb4fbfd45e69 .
hash://sha256/a35ec845ec71a2951652d70e574e6280c843879efc3b1639e9ccdb4fbfd45e69 http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/8ff46ae6a30bf9647df0294b92434a83784626b3f8c37163db3edefb049daead .
hash://sha256/8ff46ae6a30bf9647df0294b92434a83784626b3f8c37163db3edefb049daead http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/afc2a4caa7f07503ccda9154d34dea1852c8283dee5cb4c5df7ddb3ce238ab7d .
hash://sha256/afc2a4caa7f07503ccda9154d34dea1852c8283dee5cb4c5df7ddb3ce238ab7d http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/5cf1a9e491f218a94af5439f90beb905ae923f94cdd85c542d85c74c241f9e6e .
hash://sha256/5cf1a9e491f218a94af5439f90beb905ae923f94cdd85c542d85c74c241f9e6e http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b .
hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/f13b15a20e4fe70b4a111e67ac20ef676404b8456dfc39694f2cb3a4c62a2b2d .
hash://sha256/f13b15a20e4fe70b4a111e67ac20ef676404b8456dfc39694f2cb3a4c62a2b2d http://www.w3.org/ns/prov#wasDerivedFrom hash://sha256/3b39831bcc286c1db44787e21b736378f5847a16b7c39bdac3dd2011e9189dc1
Facebook
Twitterhttps://semrush.ebundletools.com/company/legal/terms-of-service/https://semrush.ebundletools.com/company/legal/terms-of-service/
gbif.org is ranked #7412 in MX with 614.09K Traffic. Categories: Science. Learn more about website traffic, market share, and more!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*Source: Index Herbariorum http://sciweb.nybg.org/science2/IndexHerbariorum.asp Accessed 10/2005. # Source: http://www.gbif.org Accessed 10/2005. Note: K now has c. 140,000 records, GH has c.220,000 records, and US has c.766,000 records on GBIF (09/2007), some other institutions have increased their online records substantially during the past 24 months.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Context
Invasive alien species have been pointed out as an important driver of biodiversity loss. Many policy responses are being developed to address this threat. Protected areas often represent and preserve hotspots of biological diversity and ensure the maintenance of ecosystem services crucial to human livelihoods. The impact of biological invasions can be particularly severe in protected areas and their occurrence and impact in such areas is an important element of the risk they pose. To address this, there is a need for data on the occurrence and extent of alien species invasions in protected areas.
Description
This dataset contains species occurrence and occupancy in protected areas of the Natura2000 network in Belgium (Special Conservation Areas sensu Habitat Directive and Special Protection Areas sensu Bird Directive). The dataset was generated using the Belgian occurrence cube at species level and the Belgian occurrence cube for non-native taxa (both containing GBIF data aggregated using Oldoni et al. 2020), the 1x1km EEA reference grid and the Natura2000 protected areas shapefiles from the European Environment Agency.
Data are grouped by protected area (SITECODE), year (year) and (infra)species (taxonKey, speciesKey). For each group, it provides the number of occurrences found in GBIF (n), the area of occupancy (aoo: number of 1 km2 squares), the coverage (coverage: % of 1 km2 squares), the minimum coordinateUncertaintyInMeters (min_coord_uncertainty), and the alien status (is_alien) based on the Global Register of Introduced and Invasive Species - Belgium. For infraspecific taxa in the latter, the alien status of the species is looked up and included.
The dataset is built on open science principles and intended to be completely reproducible:
Files
n), area of occupancy (aoo) and coverage of taxa (taxonKey) in Natura2000 areas of Belgium (SITECODE). Other columns included: speciesKey (for species is speciesKey = taxonKey), SITETYPE containing the site type of the Natura2000 area (one of A, B or C), min_coord_uncertainty with the lowest coordinate uncertainty in meters, is_alien containing the alien status (TRUE or FALSE) and remarks containing, if present, the infraspecific alien taxa whose occurrences contribute to the calculated aoo (only for species).protected_areas_species_occurrence.csv as retrieved from GBIF Backbone Taxonomy. Columns: taxonKey, speciesKey, scientificName, kingdom, phylum, order, class, genus, family, species, rank and includes. The latter contains the infraspecific taxa and synonyms whose occurrences contribute to the number of occurrences at species level.protected_areas_species_occurrence.csv. Columns: SITECODE as in protected_areas_species_occurrence.csv (BE*******), SITENAME containing the name of the protected area, SITETYPE as in protected_areas_species_occurrence.csv, flanders, wallonia and brussels containing whether the area is situated respectively in Flanders, Wallonia or Brussels-Capital Region (TRUE or FALSE). Field codes are in line with EEA element definitions for Natura 2000 sites.Potential use of the dataset
Currently, there is no comprehensive reporting system for invasive alien species in Natura 2000 sites. This dataset provides a baseline as to which species occur in which protected area. We envisage this dataset can be an interesting starting point for various types of analyses on alien species in protected areas in Belgium, but that it can also be used in complement to other data on alien species in protected areas to study more general patterns. Some examples of research questions:
This work has been funded under the Belgian Science Policies Brain program (BelSPO BR/165/A1/TrIAS), the European Union's LIFE program (LIFE19 NAT/BE/000953 - LIFE RIPARIAS).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have made available two databases of the Brazilian flora. The “raw database” contains data on terrestrial plant species after excluding records with invalid or missing taxonomic and georeferenced information, records outside Brazil, or from uncertain sources (i.e., the pre-filter step of the workflow). The results of each test used to flag data quality are appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records. It is worth noting that the “raw” database contains records with names not found in the Flora do Brasil and with taxonomic, spatial, and temporal issues.The “fitness-for-use” database is a filter of the “raw” database and only contains valid records that passed all data quality tests. Consequently, the result of each cleaning test is not shown. This database includes verified and standardized data on species taxonomy, geolocation, and date of collection. The databases contain data on conservation status, distribution, and establishment retrieved directly from the Brazilian Flora 2020 and accessed through the flora R package (Carvalho, 2017). Importantly, records lacking information on collecting date were not removed because they are fit-for-use for some biodiversity applications even when date information is missing.We have made available two databases of the Brazilian flora. First, a “raw” database (n = 12,762,595 records) containing the results of data quality tests appended in separate fields. This database includes records of algae and fungi species, records of species with non-accepted names, and records with taxonomic, spatial, and temporal issues. Second, a “fit-for-use” or “cleaned” database, containing 4,070,313 records of 38,207 species from 432 families. This database includes data on land plants occurring in Brazil (angiosperm, gymnosperm, ferns and lycophytes, and bryophyte), except algae and fungi species and records lacking information on collecting data.
Facebook
TwitterThe Global Biodiversity Information Facility (GBIF) was established by governments in 2001 to encourage free and open access to biodiversity data, via the Internet. Through a global network of countries and organizations, GBIF promotes and facilitates the mobilization, access, discovery and use of information about the occurrence of organisms over time and across the planet. GBIF provides three core services and products: # An information infrastructure an Internet-based index of a globally distributed network of interoperable databases that contain primary biodiversity data information on museum specimens, field observations of plants and animals in nature, and results from experiments so that data holders across the world can access and share them # Community-developed tools, standards and protocols the tools data providers need to format and share their data # Capacity-building the training, access to international experts and mentoring programs that national and regional institutions need to become part of a decentralized network of biodiversity information facilities. GBIF and its many partners work to mobilize the data, and to improve search mechanisms, data and metadata standards, web services, and the other components of an Internet-based information infrastructure for biodiversity. GBIF makes available data that are shared by hundreds of data publishers from around the world. These data are shared according to the GBIF Data Use Agreement, which includes the provision that users of any data accessed through or retrieved via the GBIF Portal will always give credit to the original data publishers. * Explore Species: Find data for a species or other group of organisms. Information on species and other groups of plants, animals, fungi and micro-organisms, including species occurrence records, as well as classifications and scientific and common names. * Explore Countries: Find data on the species recorded in a particular country, territory or island. Information on the species recorded in each country, including records shared by publishers from throughout the GBIF network. * Explore Datasets: Find data from a data publisher, dataset or data network. Information on the data publishers, datasets and data networks that share data through GBIF, including summary information on 10028 datasets from 419 data publishers.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This dataset contains all documents, the text and the pdf files, as well as the code that was used to carry out the term analysis of agriculturally relevant organisms in GBIF. The Global Biodiversity Information Facility (GBIF) is an international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. The National Agricultural Library Thesaurus (NALT) has online vocabulary tools of agricultural terms. My task was to use the agricultural terms from the NALT and analyze the agriculturally relevant organisms in GBIF. Some of the goals were:
Get descriptive statistics about Agrobiodiversity Data (AgData) in GBIF Create visualizations to view occurrence trends of the GBIF corpus and AgData in GBIF to determine gaps or biases. Provide examples of and code for how agricultural researchers can work with GBIF data.
Details about the process and the methodologies used to carry out this analysis I started off with trying to extract names from the Agricultural Thesaurus. I encountered some problems trying to extract names using the RDF format in the Thesaurus. An employee at the Library later provided me with the names in the Thesaurus in a text file. I then proceeded to extract the scientific names from that text file to run them through the GBIF API. Since there were so many of the names, the API would throw a connection error. The API can handle only so many requests in a particular interval of time. To handle this, I leveraged exception handling in Python. Every time the API threw an error, I told the script to wait for 5 seconds and then resume sending requests. Although this took a lot of time, it allowed me to get data such as year of occurrence, coordinate values about the ag relevant data from the API.
Technology
I used Python because it is has support for both web scraping and data analysis, both of which were needed for this project. I used Jupyter notebooks, run through Anaconda. Project Jupyter is a non-profit, open-source project that supports interactive data analysis and scientific computing. It allows users to code right in our browser and eliminates the need to install any other Integrated Development Environment, and also makes it very convenient to share our code. The main packages used in this project are pandas for data manipulation, requests and json to interact with the GBIF API, NumPy which adds support for array and matrix operations and more. Tableau and matplotlib has been used to create visualizations after performing the analysis in Python. Resources in this dataset:Resource Title: Code. File Name: Code.zipResource Description: This zip file contains multiple Jupyter notebooks that contain the code for all the analysis.Resource Software Recommended: Jupyter notebook,url: http://jupyter.org/ Resource Title: Visualizations. File Name: Visualizations.zipResource Description: This zip file contains Tableau workbooks for the visualizations.Resource Software Recommended: Tableau,url: https://www.tableau.com/ Resource Title: Corpus. File Name: Corpus.zipResource Description: This zip file contains the two datasets of family Apidae and Reduviidae.