4 datasets found
  1. d

    Replication Data for: Leveraging Large Language Models for Fuzzy String...

    • search.dataone.org
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Yu (2024). Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science [Dataset]. http://doi.org/10.7910/DVN/A8MKLO
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Wang, Yu
    Description

    Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

  2. Z

    Trees of India Version 1: Standardization to Records in World Flora Online...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kindt, Roeland (2024). Trees of India Version 1: Standardization to Records in World Flora Online and the World Checklist of Vascular Plants, with matches in GlobalTreeSearch and GlobalUsefulNativeTrees [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10245225
    Explore at:
    Dataset updated
    May 18, 2024
    Dataset authored and provided by
    Kindt, Roeland
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The Trees of India (ToI, Version-I) includes data on 3708 tree species distributed across 35 states/union territories of India. The database is based on systematic review of 313 literature sources published from 1872-2022.This compendium is available via Figshare and was described by Mugal et al. 2023:

    Khuroo, Anzar Ahmad; Mugal, Muzamil Ahmad; Wani, Sajad Ahmad (2023). ToI, Ver.-I : Trees of India, Version-I. figshare. Dataset. https://doi.org/10.6084/m9.figshare.23226281.v1

    Mugal, M.A., Wani, S.A., Dar, F.A. et al. Bridging global knowledge gaps in biodiversity databases: a comprehensive data synthesis on tree diversity of India. Biodivers Conserv 32, 3089–3107 (2023). https://doi.org/10.1007/s10531-023-02659-y

    Here I provide direct and fuzzy matches for taxa listed with accepted plant names in World Flora Online (version 2023.03; Borsch et al. 2020) and the World Checklist of Vascular Plants (WCVP version 10; Govaerts et al. 2021). Matching was done in R through the WorldFlora package (Kindt 2020). The taxonomic standardization process was similar to the one completed during the preparation of the third major release of the Agroforestry Species Switchboard and when preparing the GlobalUsefulNativeTrees database (GlobUNT; https://worldagroforestry.org/output/globalusefulnativetrees).

    After matching species with the WCVP, information was compiled on the native distribution documented in the WCVP for level-3 units of the World Geographical Scheme for Recording Plant Distributions that correspond to India, including India (IND), Assam (ASS), West Himalaya (WHM), East Himalaya (EHM), Laccadive Is. (LDV), Andaman Is. (AND) and Nicobar Is. (NCB). Also included after matching with the WCVP is information on the geographic area, lifeform and main biome. Similar information is available when searching for species from Plants of the World Online.

    Where a matching species was found in GlobalTreeSearch (Beech et al. 2017; https://tools.bgci.org/global_tree_search.php; accessed on 28th June 2023) filtered for India, the species name in GlobalTreeSearch is shown. Note that GlobalTreeSearch documents the native country distribution of tree species.

    Where a matching species was found in the GlobalUsefulNativeTrees database (GlobUNT, version 2023.11) filtered for India, the species name in the GlobUNT database is shown. GlobUNT has been described in the following publication: Kindt et al. (2023) GlobalUsefulNativeTrees, a database of 14,014 tree species, supports synergies between biodiversity recovery and local livelihoods in restoration. Sci Rep 13, 12640. https://doi.org/10.1038/s41598-023-39552-1.

    See the metadata for information on versions.

    Borsch, T., Berendsohn, W., Dalcin, E., Delmas, M., Demissew, S., Elliott, A., Fritsch, P., Fuchs, A., Geltman, D., Güner, A., Haevermans, T., Knapp, S., le Roux, M.M., Loizeau, P.-A., Miller, C., Miller, J., Miller, J.T., Palese, R., Paton, A., Parnell, J., Pendry, C., Qin, H.-N., Sosa, V., Sosef, M., von Raab-Straube, E., Ranwashe, F., Raz, L., Salimov, R., Smets, E., Thiers, B., Thomas, W., Tulig, M., Ulate, W., Ung, V., Watson, M., Jackson, P.W. and Zamora, N. (2020), World Flora Online: Placing taxonomists at the heart of a definitive and comprehensive global resource on the world's plants. TAXON, 69: 1311-1341. https://doi.org/10.1002/tax.12373

    Govaerts, R., Nic Lughadha, E., Black, N. et al. The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity. Sci Data 8, 215 (2021). https://doi.org/10.1038/s41597-021-00997-6

    E. Beech, M.Rivers, S. Oldfield & P. P. Smith (2017)GlobalTreeSearch: The first complete global database of tree species and country distributions, Journal of Sustainable Forestry, 36:5, 454-489, DOI: 10.1080/10549811.2017.1310049

    Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388. https://doi.org/10.1002/aps3.11388

    The developments of this dataset and GlobUNT were supported by the Darwin Initiative to project DAREX001 of Developing a Global Biodiversity Standard certification for tree-planting and restoration.

  3. African wood density database with matches to the taxonomic backbone data...

    • zenodo.org
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sammy Carsan; Roeland Kindt; Roeland Kindt; Sammy Carsan (2024). African wood density database with matches to the taxonomic backbone data sets of World Flora Online (version 2023.12) and the World Checklist of Vascular Plants (version 11) [Dataset]. http://doi.org/10.5281/zenodo.11543911
    Explore at:
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sammy Carsan; Roeland Kindt; Roeland Kindt; Sammy Carsan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Africa
    Description

    The African Wood Density Database provides air-dry wood density data for over 750 tree species grown in Africa.

    This archive provides taxonomic matches with recent versions of World Flora Online (WFO; version 2023.12 downloaded from Zenodo; Borch et al. 2020) and the World Checklist of Vascular Plants (WCVP; version 11 downloaded from the Kew data depository; Govaerts et al. 2021). Matching was done via the WorldFlora package (version 1.14-3; Kindt 2020), using similar scripts as documented in this Rpub: https://rpubs.com/Roeland-KINDT/1134151.

    • Carsan, S. Orwa, C. Harwood, C. Kindt, R. Stroebel, A. Neufeldt, H. and Jamnadass, R. 2012. African Wood Density Database. World Agroforestry Centre, Nairobi. https://apps.worldagroforestry.org/treesnmarkets/wood/#
    • Borsch, T., Berendsohn, W., Dalcin, E., Delmas, M., Demissew, S., Elliott, A., Fritsch, P., Fuchs, A., Geltman, D., Güner, A., Haevermans, T., Knapp, S., le Roux, M.M., Loizeau, P.-A., Miller, C., Miller, J., Miller, J.T., Palese, R., Paton, A., Parnell, J., Pendry, C., Qin, H.-N., Sosa, V., Sosef, M., von Raab-Straube, E., Ranwashe, F., Raz, L., Salimov, R., Smets, E., Thiers, B., Thomas, W., Tulig, M., Ulate, W., Ung, V., Watson, M., Jackson, P.W. and Zamora, N. (2020), World Flora Online: Placing taxonomists at the heart of a definitive and comprehensive global resource on the world's plants. TAXON, 69: 1311-1341. https://doi.org/10.1002/tax.12373
    • Govaerts, R., Nic Lughadha, E., Black, N. et al. The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity. Sci Data 8, 215 (2021). https://doi.org/10.1038/s41597-021-00997-6
    • Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388. https://doi.org/10.1002/aps3.11388

    Original funding for the database was provided by the Carbon Benefits Project (CBP) supported by The Global Environment Facility (GEF). Development of the 2024 version was supported by the Darwin Initiative to project DAREX001 of Developing a Global Biodiversity Standard certification for tree-planting and restoration, by Norway’s International Climate and Forest Initiative through the Royal Norwegian Embassy in Ethiopia to the Provision of Adequate Tree Seed Portfolio project in Ethiopia, by the Green Climate Fund through the IUCN-led Transforming the Eastern Province of Rwanda through Adaptation project and through the Readiness proposal on Climate Appropriate Portfolios of Tree Diversity for Burkina Faso, by the Bezos Earth Fund to the Bezos Quality Tree Seed for Africa in Kenya and Rwanda project and by the German International Climate Initiative (IKI) to the regional tree seed programme on The Right Tree for the Right Place for the Right Purpose in Africa. When using African Wood Density database in your work, cite the 2012 version (Carsan et al. 2012) as well as this repository using the DOI.

  4. The global spectrum of plant form and function dataset: taxonomic...

    • zenodo.org
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roeland Kindt; Roeland Kindt (2025). The global spectrum of plant form and function dataset: taxonomic standardization of 45,955 taxa to World Flora Online version 2023.12 [Dataset]. http://doi.org/10.5281/zenodo.15563432
    Explore at:
    Dataset updated
    May 31, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roeland Kindt; Roeland Kindt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The global spectrum of plant form and function dataset (Diaz et al. 2022; Diaz et al. 2016; TRY 2022, accessed 15-May-2025) provides mean trait values for (i) plant height; (ii) stem specific density; (iii) leaf area; (iv) leaf mass per area; (v) leaf nitrogen content per dry mass; and (vi) diaspore (seed or spore) mass for 46,047 taxa.

    Here I provide a dataset where the taxa covered by that database were standardized to World Flora Online (Borsch et al. 2020; taxonomic backbone version 2023.12) by matching names with those in the Agroforestry Species Switchboard (Kindt et al. 2025; version 4). Taxa for which no matches could be found were standardized with the WorldFlora package (Kindt 2020), using similar R scripts and the same taxonomic backbone data as those used to standardize species names for the Switchboard. Where still no matches could be found, taxa were matched with those matched previously with a harmonized data set for TRY 6.0 (Kindt 2024).

    References

    • Díaz, S., Kattge, J., Cornelissen, J.H.C. et al. The global spectrum of plant form and function: enhanced species-level trait dataset. Sci Data 9, 755 (2022). https://doi.org/10.1038/s41597-022-01774-9
    • Díaz, S., Kattge, J., Cornelissen, J. et al. The global spectrum of plant form and function. Nature 529, 167–171 (2016). https://doi.org/10.1038
    • TRY. 2022. The global spectrum of plant form and function dataset. https://www.try-db.org/TryWeb/Data.php#81
    • Borsch, T., Berendsohn, W., Dalcin, E., Delmas, M., Demissew, S., Elliott, A., Fritsch, P., Fuchs, A., Geltman, D., Güner, A., Haevermans, T., Knapp, S., le Roux, M.M., Loizeau, P.-A., Miller, C., Miller, J., Miller, J.T., Palese, R., Paton, A., Parnell, J., Pendry, C., Qin, H.-N., Sosa, V., Sosef, M., von Raab-Straube, E., Ranwashe, F., Raz, L., Salimov, R., Smets, E., Thiers, B., Thomas, W., Tulig, M., Ulate, W., Ung, V., Watson, M., Jackson, P.W. and Zamora, N. (2020), World Flora Online: Placing taxonomists at the heart of a definitive and comprehensive global resource on the world's plants. TAXON, 69: 1311-1341. https://doi.org/10.1002/tax.12373
    • Roeland Kindt, Ilyas Siddique, Ian Dawson, Innocent John, Fabio Pedercini, Jens-Peter B. Lillesø, Lars Graudal. 2025. The Agroforestry Species Switchboard, a global resource to explore information for 107,269 plant species. bioRxiv 2025.03.09.642182; doi: https://doi.org/10.1101/2025.03.09.642182
    • Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388. https://doi.org/10.1002/aps3.11388
    • Kindt, R. (2024). TRY 6.0 - Species List from Taxonomic Harmonization – Matches with World Flora Online version 2023.12 (2024.10b) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13906338

    Funding

    The development of this dataset was supported by the German International Climate Initiative (IKI) to the regional tree seed programme on The Right Tree for the Right Place for the Right Purpose in Africa, by Norway’s International Climate and Forest Initiative through the Royal Norwegian Embassy in Ethiopia to the Provision of Adequate Tree Seed Portfolio project in Ethiopia, and by the Bezos Earth Fund to the Quality Tree Seed for Africa in Kenya and Rwanda project.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wang, Yu (2024). Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science [Dataset]. http://doi.org/10.7910/DVN/A8MKLO

Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science

Related Article
Explore at:
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Wang, Yu
Description

Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

Search
Clear search
Close search
Google apps
Main menu