Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Trees of India (ToI, Version-I) includes data on 3708 tree species distributed across 35 states/union territories of India. The database is based on systematic review of 313 literature sources published from 1872-2022.This compendium is available via Figshare and was described by Mugal et al. 2023:
Khuroo, Anzar Ahmad; Mugal, Muzamil Ahmad; Wani, Sajad Ahmad (2023). ToI, Ver.-I : Trees of India, Version-I. figshare. Dataset. https://doi.org/10.6084/m9.figshare.23226281.v1
Mugal, M.A., Wani, S.A., Dar, F.A. et al. Bridging global knowledge gaps in biodiversity databases: a comprehensive data synthesis on tree diversity of India. Biodivers Conserv 32, 3089–3107 (2023). https://doi.org/10.1007/s10531-023-02659-y
Here I provide direct and fuzzy matches for taxa listed with accepted plant names in World Flora Online (version 2023.03; Borsch et al. 2020) and the World Checklist of Vascular Plants (WCVP version 10; Govaerts et al. 2021). Matching was done in R through the WorldFlora package (Kindt 2020). The taxonomic standardization process was similar to the one completed during the preparation of the third major release of the Agroforestry Species Switchboard and when preparing the GlobalUsefulNativeTrees database (GlobUNT; https://worldagroforestry.org/output/globalusefulnativetrees).
After matching species with the WCVP, information was compiled on the native distribution documented in the WCVP for level-3 units of the World Geographical Scheme for Recording Plant Distributions that correspond to India, including India (IND), Assam (ASS), West Himalaya (WHM), East Himalaya (EHM), Laccadive Is. (LDV), Andaman Is. (AND) and Nicobar Is. (NCB). Also included after matching with the WCVP is information on the geographic area, lifeform and main biome. Similar information is available when searching for species from Plants of the World Online.
Where a matching species was found in GlobalTreeSearch (Beech et al. 2017; https://tools.bgci.org/global_tree_search.php; accessed on 28th June 2023) filtered for India, the species name in GlobalTreeSearch is shown. Note that GlobalTreeSearch documents the native country distribution of tree species.
Where a matching species was found in the GlobalUsefulNativeTrees database (GlobUNT, version 2023.11) filtered for India, the species name in the GlobUNT database is shown. GlobUNT has been described in the following publication: Kindt et al. (2023) GlobalUsefulNativeTrees, a database of 14,014 tree species, supports synergies between biodiversity recovery and local livelihoods in restoration. Sci Rep 13, 12640. https://doi.org/10.1038/s41598-023-39552-1.
See the metadata for information on versions.
Borsch, T., Berendsohn, W., Dalcin, E., Delmas, M., Demissew, S., Elliott, A., Fritsch, P., Fuchs, A., Geltman, D., Güner, A., Haevermans, T., Knapp, S., le Roux, M.M., Loizeau, P.-A., Miller, C., Miller, J., Miller, J.T., Palese, R., Paton, A., Parnell, J., Pendry, C., Qin, H.-N., Sosa, V., Sosef, M., von Raab-Straube, E., Ranwashe, F., Raz, L., Salimov, R., Smets, E., Thiers, B., Thomas, W., Tulig, M., Ulate, W., Ung, V., Watson, M., Jackson, P.W. and Zamora, N. (2020), World Flora Online: Placing taxonomists at the heart of a definitive and comprehensive global resource on the world's plants. TAXON, 69: 1311-1341. https://doi.org/10.1002/tax.12373
Govaerts, R., Nic Lughadha, E., Black, N. et al. The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity. Sci Data 8, 215 (2021). https://doi.org/10.1038/s41597-021-00997-6
E. Beech, M.Rivers, S. Oldfield & P. P. Smith (2017)GlobalTreeSearch: The first complete global database of tree species and country distributions, Journal of Sustainable Forestry, 36:5, 454-489, DOI: 10.1080/10549811.2017.1310049
Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388. https://doi.org/10.1002/aps3.11388
The developments of this dataset and GlobUNT were supported by the Darwin Initiative to project DAREX001 of Developing a Global Biodiversity Standard certification for tree-planting and restoration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The African Wood Density Database provides air-dry wood density data for over 750 tree species grown in Africa.
This archive provides taxonomic matches with recent versions of World Flora Online (WFO; version 2023.12 downloaded from Zenodo; Borch et al. 2020) and the World Checklist of Vascular Plants (WCVP; version 11 downloaded from the Kew data depository; Govaerts et al. 2021). Matching was done via the WorldFlora package (version 1.14-3; Kindt 2020), using similar scripts as documented in this Rpub: https://rpubs.com/Roeland-KINDT/1134151.
Original funding for the database was provided by the Carbon Benefits Project (CBP) supported by The Global Environment Facility (GEF). Development of the 2024 version was supported by the Darwin Initiative to project DAREX001 of Developing a Global Biodiversity Standard certification for tree-planting and restoration, by Norway’s International Climate and Forest Initiative through the Royal Norwegian Embassy in Ethiopia to the Provision of Adequate Tree Seed Portfolio project in Ethiopia, by the Green Climate Fund through the IUCN-led Transforming the Eastern Province of Rwanda through Adaptation project and through the Readiness proposal on Climate Appropriate Portfolios of Tree Diversity for Burkina Faso, by the Bezos Earth Fund to the Bezos Quality Tree Seed for Africa in Kenya and Rwanda project and by the German International Climate Initiative (IKI) to the regional tree seed programme on The Right Tree for the Right Place for the Right Purpose in Africa. When using African Wood Density database in your work, cite the 2012 version (Carsan et al. 2012) as well as this repository using the DOI.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The global spectrum of plant form and function dataset (Diaz et al. 2022; Diaz et al. 2016; TRY 2022, accessed 15-May-2025) provides mean trait values for (i) plant height; (ii) stem specific density; (iii) leaf area; (iv) leaf mass per area; (v) leaf nitrogen content per dry mass; and (vi) diaspore (seed or spore) mass for 46,047 taxa.
Here I provide a dataset where the taxa covered by that database were standardized to World Flora Online (Borsch et al. 2020; taxonomic backbone version 2023.12) by matching names with those in the Agroforestry Species Switchboard (Kindt et al. 2025; version 4). Taxa for which no matches could be found were standardized with the WorldFlora package (Kindt 2020), using similar R scripts and the same taxonomic backbone data as those used to standardize species names for the Switchboard. Where still no matches could be found, taxa were matched with those matched previously with a harmonized data set for TRY 6.0 (Kindt 2024).
References
Funding
The development of this dataset was supported by the German International Climate Initiative (IKI) to the regional tree seed programme on The Right Tree for the Right Place for the Right Purpose in Africa, by Norway’s International Climate and Forest Initiative through the Royal Norwegian Embassy in Ethiopia to the Provision of Adequate Tree Seed Portfolio project in Ethiopia, and by the Bezos Earth Fund to the Quality Tree Seed for Africa in Kenya and Rwanda project.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.