Facebook
Twitterhttps://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
In order to facilitate the anonymisation of data, this list of first names and surnames was extracted from the SIRENE database of INSEE.
For each first name and surname, the number of appearances is indicated.
ATTENTION: No content check is done, and these lists may contain anomalies present in the original database!
Facebook
TwitterThe data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
A unique resource that has been developed in cooperation with a team of native-speaker experts in Persian phonology. The data includes a confidence rank to indicate the relative likelihood that a variant will be encountered in the real world.
Facebook
TwitterSearch for a business by name. You can obtain business information and then proceed to purchase a certificate of good standing or other documents. The purpose of this search is simply to determine whether a company/entity exists and to provide basic information on the company/entity.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of US baby names from 1910 to 2021. Includes State, Sex, Year, Name, and Count as features.
Mainly used for a tutorial but can be used for classification/other visualizations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary data on changes to data in the Plant Names Database in the following classes: the addition of new names for formal deprecation of duplicate names changes to the status of the name as preferred name or synonym for a taxon updating the origin or occurrence of a taxon within New Zealand applying changes to the classification of a taxon updating the scientific article that is being applied to the taxa to determine whether the name is a synonym or preferred name
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
A resource of Arab personal names and variants, in the original Arabic script, this database covers several hundred thousand Arabic script variants, along with common spelling mistakes. Every Arabic name is normalized and vocalized.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of first names in Brasil, collected by IBGE.
This dataset could be used in many applications, including helping de-identify text.
The files contain one name per line, including variations of the same names.
Facebook
Twitterhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This resource covers four million Japanese names and their romanized variants, and includes gender codes, classification codes, and frequency rankings.
Facebook
TwitterStreet Name Master List - contains all the reserved and active street names.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This is the HANA database of handwritten personal names as introduced in the paper HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition (official code available here). The minipics are from police register sheets from Copenhagen which cover all adults (above the age of 10) residing in the capital of Denmark, Copenhagen, in the period from 1890 to 1923.
The labels in the .csv files refer to the main character on the original register sheets. Each row contains a reference to the corresponding image as the first element and the name as the second element. The HANA database consists of 1,105,904 images with corresponding labels. The last name is always only one word and if multiple last names were transcribed, the last of these were chosen as the last name, while the remaining were moved to the end of the first names. The first names can consist of up to nine individual words.
All names are written in lower case letters and contain only characters which are used in Danish words, which implies 29 alphabetic characters i.e., this database includes the letters æ, ø, and å.
If anything is missing or if you are interested in the original documents from Copenhagen Archives to improve, e.g., the segmentation, feel free to reach out at sfw@sam.sdu.dk.
We wish you the best of luck.
Facebook
TwitterWe provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In portuguese (Brazil official idiom), people's names are usually related to their natural gender (male or female only). That been said the sole name of a person contains patterns that can reveal their natural gender.
The dataset contains only two columns: name and gender.
The main goal of this dataset is to be able to automatically classify the natural gender of a person based only on it's name.
Facebook
TwitterDatabase of Irish Place Names --> --> External Link--> --> -->
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This Data set contains various files containing over 1000 names and their respective gender information. This can be used to predict Gender using phonics from their respective names.
Searching for a person's name in a database is a unique challenge. Depending on the source and age of the data, you may not be able to count on the spelling of the name being correct, or even the same name being spelled the same way when it appears more than once. Discrepancies between stored data and search terms may be introduced due to personal choice or cultural differences in spellings, homophones, transcription errors, illiteracy, or simply lack of standardized spellings during some time periods. These sorts of problems are especially prevalent in transcriptions of handwritten historical records used by historians, genealogists, and other researchers.
A common way to solve the string-search problem is to look for values that are "close" to the same as the search target. Using a traditional fuzzy match algorithm to compute the closeness of two arbitrary strings is expensive, though, and it isn't appropriate for searching large data sets. A better solution is to compute hash values for entries in the database in advance, and several special hash algorithms have been created for this purpose. These phonetic hash algorithms allow you to compare two words or names based on how they sound, rather than the precise spelling.
Early Efforts: Soundex One such algorithm is Soundex, developed by Margaret K. Odell and Robert C. Russell in the early 1900s. The Soundex algorithm appears frequently in genealogical contexts because it's associated with the U.S. Census and is specifically designed to encode names. A Soundex hash value is calculated by using the first letter of the name and converting the consonants in the rest of the name to digits by using a simple lookup table. Vowels and duplicate encoded values are dropped, and the result is padded up to—or truncated down to—four characters.
The Fuzzy library includes a Soundex implementation for Python programs
This dataset can be used to explore the power of Fuzzy Source
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This database is part of the ArabLEX set of data which consists of the Database of Arabic General Vocabulary (DAG), Database of Arabic Place Names (DAP), Database of Foreign Names in Arabic (DAF) and Database of Arab Names (DAN) available from ELRA under references, respectively, ELRA-L0131, ELRA-M0105, ELRA-M0106 and ELRA-M0107.This full-form Arabic-English place name database of over 21,000 lemmas and nearly 6.5 million forms provides worldwide coverage of common place names, given in standard MSA orthography, and includes all inflected and cliticized forms for each place name. In addition, precise phonemic transcriptions and full vowel diacritics are designed to enhance Arabic speech technology. Orthographic variants are also extensively covered.This database is provided with three options: 1) proclitics, 2) phonetic information (CARS) and 3) orthographic variants. Subsets excluding some of the three proposed options may be provided upon demand. CARS is an accurate phonemic transcription. Optionally, phonetic transcriptions, IPA and/or SAMPA, can be provided, fine tuned to a customer's specifications.Quantity and size: 6,455,201 lines / 812 MBFile format: flat TSV text filesSamples and a specifications document available upon request.
Facebook
Twitterhttps://www.ontario.ca/page/copyright-informationhttps://www.ontario.ca/page/copyright-information
This dataset contains a listing of individuals who have had their name formally changed in Ontario.
This data is made publicly available through the Ontario Gazette.
Facebook
Twitterhttps://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Name Find Source LLC Whois Database, discover comprehensive ownership details, registration dates, and more for Name Find Source LLC with Whois Data Center.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The GeoNames geographical database contains over 10 million geographical names and consists of over 9 million unique features with 2.8 million populated places and 5.5 million alternate names. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes.
The main 'geoname' table has the following fields :
AdminCodes:
Most adm1 are FIPS codes. ISO codes are used for US, CH, BE and ME. UK and Greece are using an additional level between country and fips code. The code '00' stands for general features where no specific adm1 code is defined. The corresponding admin feature is found with the same countrycode and adminX codes and the respective feature code ADMx.
feature classes:
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This database covers non-Arabic names, their Arabic equivalents, and Arabic script variants for each name (with the most important variant given first).
Facebook
Twitterhttps://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
In order to facilitate the anonymisation of data, this list of first names and surnames was extracted from the SIRENE database of INSEE.
For each first name and surname, the number of appearances is indicated.
ATTENTION: No content check is done, and these lists may contain anomalies present in the original database!