Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
First names and last names by country according to affiliations in journal articles 2001-2021 as recorded in Scopus. For 200 countries, there is a complete list of all first names and all last names of at least one researcher with a national affiliation in that country. Each file also records: the number of researchers with that name in the country, the proportion of researchers with that name in the country compared to the world, the number of researchers with that name in the world,
For example, for the USA:
Name Authors in USA Proportion in USA Total Sadrach 3 1.000 3 Rangsan 1 0.083 12 Parry 6 0.273 22 Howard 2008 0.733 2739
Only the first parts of double last names are included. For example, Rodriquez Gonzalez, Maria would have only Rodriquez recorded.
This is from the paper: "Can national researcher mobility be tracked by first or last name uniqueness"
List of countries Afghanistan; Albania; Algeria; Angola; Argentina; Armenia; Australia; Austria; Azerbaijan; Bahamas; Bahrain; Bangladesh; Barbados; Belarus; Belgium; Belize; Benin; Bermuda; Bhutan; Bolivia; Bosnia and Herzegovina; Botswana; Brazil; Brunei Darussalam; Bulgaria; Burkina Faso; Burundi; Cambodia; Cameroon; Canada; Cape Verde; Cayman Islands; Central African Republic; Chad; Chile; China; Colombia; Congo; Costa Rica; Cote d'Ivoire; Croatia; Cuba; Cyprus; Czech Republic; Democratic Republic Congo; Denmark; Djibouti; Dominican Republic; Ecuador; Egypt; El Salvador; Eritrea; Estonia; Ethiopia; Falkland Islands (Malvinas); Faroe Islands; Federated States of Micronesia; Fiji; Finland; France; French Guiana; French Polynesia; Gabon; Gambia; Georgia; Germany; Ghana; Greece; Greenland; Grenada; Guadeloupe; Guam; Guatemala; Guinea; Guinea-Bissau; Guyana; Haiti; Honduras; Hong Kong; Hungary; Iceland; India; Indonesia; Iran; Iraq; Ireland; Israel; Italy; Jamaica; Japan; Jordan; Kazakhstan; Kenya; Kuwait; Kyrgyzstan; Laos; Latvia; Lebanon; Lesotho; Liberia; Libyan Arab Jamahiriya; Liechtenstein; Lithuania; Luxembourg; Macao; Macedonia; Madagascar; Malawi; Malaysia; Maldives; Mali; Malta; Martinique; Mauritania; Mauritius; Mexico; Moldova; Monaco; Mongolia; Montenegro; Morocco; Mozambique; Myanmar; Namibia; Nepal; Netherlands; New Caledonia; New Zealand; Nicaragua; Niger; Nigeria; North Korea; North Macedonia; Norway; Oman; Pakistan; Palau; Palestine; Panama; Papua New Guinea; Paraguay; Peru; Philippines; Poland; Portugal; Puerto Rico; Qatar; Reunion; Romania; Russia; Russian Federation; Rwanda; Saint Kitts and Nevis; Samoa; San Marino; Saudi Arabia; Senegal; Serbia; Seychelles; Sierra Leone; Singapore; Slovakia; Slovenia; Solomon Islands; Somalia; South Africa; South Korea; South Sudan; Spain; Sri Lanka; Sudan; Suriname; Swaziland; Sweden; Switzerland; Syrian Arab Republic; Taiwan; Tajikistan; Tanzania; Thailand; Timor-Leste; Togo; Trinidad and Tobago; Tunisia; Turkey; Uganda; Ukraine; United Arab Emirates; United Kingdom; United States; Uruguay; Uzbekistan; Vanuatu; Venezuela; Viet Nam; Virgin Islands (U.S.); Yemen; Yugoslavia; Zambia; Zimbabwe
The first names file contains data on the first names attributed to children born in France since 1900. These data are available at the level of France and by department. The files available for download list births and not living people in a given year. They are available in two formats (DBASE and CSV). To use these large files, it is recommended to use a database manager or statistical software. The file at the national level can be opened from some spreadsheets. The file at the departmental level is however too large (3.8 million lines) to be consulted with a spreadsheet, so it is proposed in a lighter version with births since 2000 only. The data can be accessed in: - a national data file containing the first names attributed to children born in France between 1900 and 2022 (data before 2012 relate only to France outside Mayotte) and the numbers by sex associated with each first name; - a departmental data file containing the same information at the department of birth level; - a lighter data file that contains information at the department level of birth since the year 2000.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NOTE: No specific individual information is given.The Census Bureau receives numerous requests to supply information on name frequency. In an effort to comply with those requests, the Census Bureau has embarked on a names list project involving a tabulation of names from the 1990 Census. These files contain only the frequency of a given name, no specific individual information.[ed.note: all links point to the original URL; all files are available in this repository]Name List: Documentation and Methodology <1.0MBFrequently Occurring Surnames from Census 1990 – Names Files[ed. note: this content was originally on a separate webpage, at https://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html]Filesdist.all.last [<1.0MB]dist.female.first [<1.0MB]dist.male.first [<1.0MB]Each of the three files, (dist.all.last), (dist. male.first), and (dist female.first) contain four items of data. The four items are:A "Name"Frequency in percentCumulative Frequency in percentRankIn the file (dist.all.last) one entry appears as:MOORE 0.312 5.312 9In our search area sample, MOORE ranks 9th in terms of frequency. 5.312 percent of the sample population is covered by MOORE and the 8 names occurring more frequently than MOORE. The surname, MOORE, is possessed by 0.312 percent of our population sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Count of popularity of adult first names (forenames, given names) in Peru, from an approximately 7% sample of the adult population.
In Peru, many people are registered as supporters of political parties, and their names are published by the Registro de Organizaciones Políticas. The lists include a DNI (national identity number) for each person to avoid duplicates. The 1,572,002 people on these lists (excluding the regional movements) represent around 7% of the adult population of Peru.
The first and middle names have been sorted and counted (there are an average of 1.6 first names for each person).
These 2,538,011 first (and middle) names represent 76,720 different names, most of which are infrequent. The file has been limited to names that occur ten or more times in the sample, which is 7,250 unique names (2,417,750 names, more than 95% of the total).
Each row in the file contains the rank, a percentage of that name in the entire set of 2,538,011 names, a count of the times the name occurs in the sample, and the name.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Rank and count of the top names for baby boys, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Genism LDA Topic Model of English Wikipedia biographical articles with list of all 1.8M articles, and some associated Wikidata information
The model has 150 Topics.
This model was developed in the process of isolating a set of visual arts biographical articles, as described in "Clowns in the Visual Artists: Topic Modeling Wikipedia and Wikidata" in the Spring 2022 issue of Art Documentation - https://doi.org/10.1086/719999
Because names, nationalities, and birthdays are so prominent in biographies, the stopwords list removed 170,000 names, surnames, city names, place names, countries, days, months and other time related words (https://github.com/mandiberg/Names-Surnames-and-Countries-for-Stopwords). We also directly removed each article subject’s given and surname, which were almost always the most frequently occurring words in any given article. Otherwise, the model just produced topics based on nationality, and common names and surnames.
Files:
all_enwiki_bios_from_wikidata.csv The list of all Wikidata items for humans with an enwiki page (e.g biographical article) was extracted from Wikidata JSON dump; list includes gender, occupation, and nationality. This was joined with the converted plaintext from an English Wikipedia dump. This data was downloaded in March 2021.
Wikipedia Biographies LDA Topic Model human readable summary.csv A human readable file with the 150 topics ranked by count of articles per topic from the 1.8M corpus. The most popular topics have categorical descriptions of the occupations of each cluster. Some are marked as not an occupation cluster.
BoW_corpus.mm* model_lda_full_Sep2_150Tv2* These six files comprise the topic model. The code to load them is present in the python files.
dict_full_Aug-28-2021 processed_docs_full_Aug-28-2021.txt processed_docs_1000_Aug-18-2021.txt These are the dictionary and processed corpuses required to build and implement the model using this code. The corpus with the first 1000 items is meant to be used for testing, as the full one is quite large and takes a long time to complete.
topic-model-wikipedia-sept2021.zip The code and settings used for creating and implementing this model are included in this zip and are also available here: https://github.com/mandiberg/topic-model-wikipedia
All-Wikipedia-Biographies-with-topic1.csv All-Wikipedia-Biographies-with-topic1and2.csv These are the list of 1.8M biographies matched to topics. The "topic1" file just includes the first topic, this is a slightly larger list. The "topic1and2" file is slightly smaller because about 2% articles do not match to a second topic.
Analysis-for-Clowns-Visual-Arts.zip These are the raw data and final data produced for the "Clowns in the Visual Artists." Please see the article for context.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[ed. note: from https://www.census.gov/topics/population/genealogy/data/2000_surnames.html as of May 29, 2017. Has also been referenced as http://www.census.gov/genealogy/www/data/2000surnames/index.html]NOTE: This presentation of data focuses on summarized aggregates of counts of surnames, and does not in any way identify specific individuals.Tabulations of all surnames occurring 100 or more times in the Census 2000 returns are provided in the files listed below. The first link explains the methodology used for identifying and editing names data. The second link provides an Excel file of the top 1000 surnames. The third link provides zipped Excel and CSV (comma separated) files of the complete list of 151,671 names. Related Files [Ed. note: the links point to the original location; all files are available in this archive as well]Technical Documentation: Demographic Aspects of Surnames - Census 2000 <1.0MBFile A: Top 1000 Names <1.0MBFile B: Surnames Occurring 100 or more times <1.0MB
We provide four dictionaries that provide the racial distributions associated with names in the United States. These dictionaries are used by the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations. The probabilities cover five racial categories: White, Black, Hispanic, Asian, and Other. We provide two surname dictionaries. The first provides entries P(race | surname) for about 160K names, derived from the 2010 Census surname list, aggregated with the Census Spanish surname list. The second provides analogous probabilities for 1.48MM surnames. This dictionary is created by starting with the Census-based dictionary and supplementing it with race distributions estimated from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race data. We also provide dictionaries estimating P(race | first name) and P(race | middle name). These dictionaries -- which contain 1.04MM and 1.16MM names respectively -- are sourced exclusively from the voter files of the six Southern states. References Kabir Khanna, Brandon Bertelsen, Santiago Olivella, Evan Rosenman and Kosuke Imai (2022). wru: Who are You? Bayesian Prediction of Racial Category Using Surname, First Name, Middle Name, and Geolocation. R package version 1.0.0. https://CRAN.R-project.org/package=wru
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. 4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.Please see the paper for details on how the datasets were created.Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6The uploaded datasets were created by combining several data sources below.1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data.https://info.orcid.org/public-data-file-use-policy/2. MEDLINE baseline data were downloaded from the link below for the 2016 version.Please refer to the policies on the use of MEDLINE data.https://www.nlm.nih.gov/databases/download/pubmed_medline.html3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.Please refer to the policies on the use of those datasets.https://databank.illinois.edu/datasets/IDB-9087546Please cite three papers below to properly give credits to the creators of the original datasets.Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.24677204. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1Please cite the paper below to properly give credits to the creators of the original dataset.Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SubInfo.mat contains information about the file names for MEG data, eye movement data and Psychtoolbox dataIn total, we have 39 participants for both datasets. The field names for the first dataset is 'Targ60', for the second one is 'JEP60' (adapted from Degno et al. 2019, Journal of Experimental Psychology).The SubInfo.mat also contains the MRI codes. For participants who do not have MRI (T1) images, nan is the code and a template T1 images will be used in the source modeling. We have T1 images for 36 participants.
The ARS Water Data Base is a collection of precipitation and streamflow data from small agricultural watersheds in the United States. This national archive of variable time-series readings for precipitation and runoff contains sufficient detail to reconstruct storm hydrographs and hyetographs. There are currently about 14,000 station years of data stored in the data base. Watersheds used as study areas range from 0.2 hectare (0.5 acres) to 12,400 square kilometers (4,786 square miles). Raingage networks range from one station per watershed to over 200 stations. The period of record for individual watersheds vary from 1 to 50 years. Some watersheds have been in continuous operation since the mid 1930's. Resources in this dataset:Resource Title: FORMAT INFORMATION FOR VARIOUS RECORD TYPES. File Name: format.txtResource Description: Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by _location number in the form, LXX where XX is the _location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the _location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.Resource Title: Data Inventory - watersheds. File Name: inventor.txtResource Description: Watersheds at which records of runoff were being collected by the Agricultural Research Service. Variables: Study Location & Number of Rain Gages1; Name; Lat.; Long; Number; Pub. Code; Record Began; Land Use2; Area (Acres); Types of Data3Resource Title: Information about the ARS Water Database. File Name: README.txtResource Title: INDEX TO INFORMATION ON EXPERIMENTAL AGRICULTURAL WATERSHEDS. File Name: INDEX.TXTResource Description: This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each _location has been assigned a number. The data for that _location will be stored in a sub-directory coded as LXX where XX is the _location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. Resource Title: ARS Water Database files. File Name: ars_water.zipResource Description: USING THIS SYSTEM Before downloading huge amounts of data from the ARS Water Data Base, you should first review the text files included in this directory. They include: INDEX OF ARS EXPERIMENTAL WATERSHEDS: index.txt This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each _location has been assigned a number. The data for that _location will be stored in a sub-directory coded as LXX where XX is the _location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. STATION TABLE FOR THE ARS WATER DATA BASE: station.txt This report indicates the period of record for each recording station represented in the ARS Water Data Base. The data for a particular station will be stored in a single compressed file. FORMAT INFORMATION FOR VARIOUS RECORD TYPES: format.txt Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by _location number in the form, LXX where XX is the _location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the _location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the scripts and input data files for the LaMEM to produce data for the surrogate model constructions described in the paper “Challenges of Quantifying Geodynamic Effects – Insights from the Alpine Region “
Contents:
Abstract Biodiversity research has advanced by testing expectations of ecological and evolutionary hypotheses through the linking of large-scale genetic, distributional, and trait datasets. The rise of molecular systematics over the past 30 years has resulted in a wealth of DNA sequences from around the globe. Yet, advances in molecular systematics also have created taxonomic instability, as new estimates of evolutionary relationships and interpretations of species limits have required widespread scientific name changes. Taxonomic instability, colloquially “splits, lumps, and shuffles,†presents logistical challenges to large-scale biodiversity research because (1) the same species or sets of populations may be listed under different names in different data sources, or (2) the same name may apply to different sets of populations representing different taxonomic concepts. Consequently, distributional and trait data are often difficult to link directly to primary DNA sequen..., Taxonomic reconciliationWe downloaded all names from the NCBI Taxonomy database (Schoch et al., 2020) that descended from “Aves†(TaxID: 8782) on 3 May 2020 (Data Repository D2). From this list, we extracted all species and subspecies names as well as their NCBI Taxonomy ID (TaxID) numbers. We then ran a custom Perl script (Data Repository D3) to exactly match binomial (genus, species) and trinomial (genus, species, subspecies) names from NCBI Taxonomy to the names recognized by eBird/Clements v2019 Integrated Checklist (August 2019; Data Repository D4). For each mismatch with the NCBI Taxonomy name, we then identified the corresponding equivalent eBird/Clements species or subspecies. We first searched for names in Avibase (Lepage et al., 2014). However, Avibase’s search function currently facilitates only exact matches to taxonomies it implements. For names that were not an exact match to an Avibase taxonomic concept, we implemented web searches (Google) which often identified minor sp..., D1:"PetersVsClements2Final.txt" - This file tells which species from the Peters taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species name from the Peters taxonomy. In the second column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters."SibleyMonroeVsClements_Final.txt" - This file tells which species from the Sibley Monroe taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species ID number from the Sibley Monroe taxonomy. The second column has the species scientific name from the Sibley Monroe taxonomy. The third column has the common name from the Sibley Monroe taxonomy. In the fourth column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters.D2:"taxonomy_result.unix.xml" ...
Personal names are a universal feature of human language, yet few analogs exist in other species. While dolphins and parrots address conspecifics by imitating the calls of the addressee 1,2, human names are not imitations of the sounds typically made by the named individual 3. Labeling objects or individuals without relying on imitation of the sounds made by the referent radically expands the expressive power of language. Thus, if non-imitative name analogs were found in other species, this could have important implications for our understanding of language evolution. Here, we present evidence that wild African elephants address one another with individually specific calls, likely without relying on imitation of the receiver. We used machine learning to demonstrate that the receiver of a call could be predicted from the call’s acoustic structure, regardless of how similar the call was to the receiver’s vocalizations. Moreover, elephants differentially responded to playbacks of calls or..., , Excel or similar spreadsheet program to open CSV files. R or RStudio to open and run code, and to open RDS file. Observational dataset (.csv file): Behavioral context information for vocalizations recorded from wild African elephants in Samburu and Buffalo Springs National Reserves, Kenya between November 2019 and April 2022. Each row represents a single call and information such as identity of the caller, identity of the addressee, and behavioral context, is provided. Information is also included linking each call to the original sound file and Raven Pro selection table from which it came (but the sound files and Raven Pro selection tables are not included in this archive). Acoustic measurements dataset (.rds file): R list object containing vectors of acoustic measurements for each call. The list is two-tiered; each outer slot corresponds to a single call, and each inner slot corresponds to an acoustic measurement, represented as a vector of numbers. R code (.R file): Code for aligning..., ## # African elephants address one another with individually specific calls
We investigated the hypothesis that elephants address individual members of their family group with name-like calls. We recorded contact, greeting, and caregiving rumbles from wild African elephants in Samburu & Buffalo Springs National Reserves, northern Kenya and Amboseli National Park, southern Kenya, noting when possible the identity of the caller and the identity of the receiver. We measured a suite of acoustic features on each call and found that calls were specific to individual receivers and receiver identity could be predicted from call structure at better than chance levels. We also played back calls to the elephants and found that elephants responded more quickly and vocalized more in response to calls that were originally addressed to them compared to calls from the same caller that were originally addressed to someone else. These results provide the first evidence that elephants address one ano...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author Lotte Burkhardt published in 2022 a free PDF entitled Encyclopedia of Eponymic Plant names. It consisted of two volumes, one listing all plant, algae, lichen, fossil plant, and fungal genera with the person they were named after. The other volume takes the list of people honored and lists the genera named after them. It can be found online here.
This dataset was created by Carmen Ulloa Ulloa by scraping the PDF of the A-Z names of people honored and converting it into a Google Sheet. That data were normalized with each row representing a person and the eponymic genera and the associated families split into multiple columns to make analysis easier. The data was then cleaned as the conversion from PDF was not 100% accurate with some names being split onto multiple lines, characters misread etc. The gender of the authors were annotated by the Women Plant Genera working group as part of our follow up work to a previous paper.
We have split the resulting table into three files. The first one contains the entire list of people honoured and the genera named for them. The other two are the first table split into just the flowering plant genera and the other one excludes plant genera.
Most of the women in the plants-only tab have been marked up from this project. More information could be added to the women for whom non-plant genera were named. We highly encourage anyone who is interested in an analysis of their own based on this data to do so, and get in touch with us with any questions. We anticipate that work on additional groups will deepen our understanding of the impact of the contributions women have made to botany. Our hope is that by making this dataset publically available others will explore the world of genera and eponomy, looking at interesting stories of people for whom genera were named.
The team would be greatful for any updates or corrections to this data, and we plan to publish updated versions of this dataset accordingly.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This explanation pertains to the data prepared for Non sola scriptura: Essays on the Qur’an and Islam in Honour of William A. Graham (Routledge), Chapter by Sarah Bowen Savant, “People versus Books.”
We are releasing data that was used to create for the chapter, Graphs 1 and 2 and also Tables 1-3.
Note: All the data files (except the text in number 3) are in TSV format (Tab Separated Values) and any text editor or tabular data editor, such as Excel can deal with it.
“IsnadFractions_PeopleversusBooks”. This file represents a filtered version of an output from Ryan Muther’s isnād classifier algorithm. Muther ran the algorithm in July 2020, based on the Version 2020.1.2 release of the corpus, available at: http://doi.org/10.5281/zenodo.3891466. The data file includes:
author: the name of the author.
died: death date of author. NB: Especially the early dates cannot be relied on.
title: the title of the author’s book, from the OpenITI Corpus.
length: length of the book, measured in word-tokens.
isnad_fraction: the percentage of the book’s word-tokens that are made up of isnāds.
“GALTags_PeopleversusBooks”. Books in the OpenITI were mapped by Walid A. Akef in 2018 to:
Brockelmann, Carl, History of the Arabic Written Traditions, trans. Joep Lameer, 2 vols and 3 supplements, Leiden: Brill, 2016-2018.
The file includes the following columns:
id: book id, from the OpenITI Corpus.
gal_tags: the GAL tags, also used in the OpenITI Corpus
“0571IbnCasakir.TarikhDimashq.JK000916-ara1.mARkdown”. The Ibn ʿAsākir text file, from the Version 2020.1.2 release of the OpenITI Corpus.
“NamedEntities_PeopleversusBooks”. This is a very first effort at working on named entities in Ibn ʿAsākir’s Taʾrīkh Madīnat Dimashq and represents only a tiny fraction of the surface forms of names. Most of the names pertain to persons who transmitted from Ibn Saʿd. There may be some duplicate surface forms (which does not affect the method). We use this list to replace the surface forms with transliterated values. The column description is as below:
name: the normalized name.
ar_name: the Arabic name, which are the surface forms.
“SplittingTerms_PeopleversusBooks”. We started with a list of transmissive terms that R. Kevin Jaques originated and then added more terms, which include the various normalized forms of the same term. We used this list to split isnāds into names.
“IbnSadIsnads_PeopleversusBooks”. This file includes the pieces of texts that the algorithm tags as isnāds in the text. We extracted the tagged pieces and made a list of isnāds. Almost all of the isnāds start with a transmissive term. We use this file to extract the names and clean some rows to generate a data table that we can use for clustering. Below are the brief description of the column:
text_ID: this contains the book id from the OpenITI Corpus. This column can be ignored as we are using it for one text in this project. However, it is required in the collection of isnāds from multiple texts.
id: a unique identifier assigned to each isnād. The isnād classifier algorithm assigns this id and can be used to identify each isnād in the text when required.
isnad_text: the isnād that we extract from the text.
length: length of the extracted isnād in tokens
name_at_position_X: the rest of the columns in this table include the pieces of the isnād that we get after splitting the isnāds with a list of terms. Each column contains a name or any string that appears between two transmissive terms. Some cells are empty and it is because we probably miss some transmissive terms.
“IbnSadClusters_PeopleversusBooks”. This file includes clusters of isnāds of length six (i.e. isnāds that include six names). We have used the affinity propagation (AP) clustering algorithm based on the Levenstein similarity score of the names. Below is the column description:
frequency: the frequency of the isnād in the data
cluster_id: the id of the cluster to which the isnād belongs
“PassimCol-Definition_PeopleversusBooks”. Description of the columns in passim outputs.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
First names and last names by country according to affiliations in journal articles 2001-2021 as recorded in Scopus. For 200 countries, there is a complete list of all first names and all last names of at least one researcher with a national affiliation in that country. Each file also records: the number of researchers with that name in the country, the proportion of researchers with that name in the country compared to the world, the number of researchers with that name in the world,
For example, for the USA:
Name Authors in USA Proportion in USA Total Sadrach 3 1.000 3 Rangsan 1 0.083 12 Parry 6 0.273 22 Howard 2008 0.733 2739
Only the first parts of double last names are included. For example, Rodriquez Gonzalez, Maria would have only Rodriquez recorded.
This is from the paper: "Can national researcher mobility be tracked by first or last name uniqueness"
List of countries Afghanistan; Albania; Algeria; Angola; Argentina; Armenia; Australia; Austria; Azerbaijan; Bahamas; Bahrain; Bangladesh; Barbados; Belarus; Belgium; Belize; Benin; Bermuda; Bhutan; Bolivia; Bosnia and Herzegovina; Botswana; Brazil; Brunei Darussalam; Bulgaria; Burkina Faso; Burundi; Cambodia; Cameroon; Canada; Cape Verde; Cayman Islands; Central African Republic; Chad; Chile; China; Colombia; Congo; Costa Rica; Cote d'Ivoire; Croatia; Cuba; Cyprus; Czech Republic; Democratic Republic Congo; Denmark; Djibouti; Dominican Republic; Ecuador; Egypt; El Salvador; Eritrea; Estonia; Ethiopia; Falkland Islands (Malvinas); Faroe Islands; Federated States of Micronesia; Fiji; Finland; France; French Guiana; French Polynesia; Gabon; Gambia; Georgia; Germany; Ghana; Greece; Greenland; Grenada; Guadeloupe; Guam; Guatemala; Guinea; Guinea-Bissau; Guyana; Haiti; Honduras; Hong Kong; Hungary; Iceland; India; Indonesia; Iran; Iraq; Ireland; Israel; Italy; Jamaica; Japan; Jordan; Kazakhstan; Kenya; Kuwait; Kyrgyzstan; Laos; Latvia; Lebanon; Lesotho; Liberia; Libyan Arab Jamahiriya; Liechtenstein; Lithuania; Luxembourg; Macao; Macedonia; Madagascar; Malawi; Malaysia; Maldives; Mali; Malta; Martinique; Mauritania; Mauritius; Mexico; Moldova; Monaco; Mongolia; Montenegro; Morocco; Mozambique; Myanmar; Namibia; Nepal; Netherlands; New Caledonia; New Zealand; Nicaragua; Niger; Nigeria; North Korea; North Macedonia; Norway; Oman; Pakistan; Palau; Palestine; Panama; Papua New Guinea; Paraguay; Peru; Philippines; Poland; Portugal; Puerto Rico; Qatar; Reunion; Romania; Russia; Russian Federation; Rwanda; Saint Kitts and Nevis; Samoa; San Marino; Saudi Arabia; Senegal; Serbia; Seychelles; Sierra Leone; Singapore; Slovakia; Slovenia; Solomon Islands; Somalia; South Africa; South Korea; South Sudan; Spain; Sri Lanka; Sudan; Suriname; Swaziland; Sweden; Switzerland; Syrian Arab Republic; Taiwan; Tajikistan; Tanzania; Thailand; Timor-Leste; Togo; Trinidad and Tobago; Tunisia; Turkey; Uganda; Ukraine; United Arab Emirates; United Kingdom; United States; Uruguay; Uzbekistan; Vanuatu; Venezuela; Viet Nam; Virgin Islands (U.S.); Yemen; Yugoslavia; Zambia; Zimbabwe