100+ datasets found

d
Race and ethnicity data for first, middle, and last names
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosenman, Evan; Olivella, Santiago; Imai, Kosuke (2023). Race and ethnicity data for first, middle, and last names [Dataset]. http://doi.org/10.7910/DVN/SGKW0K
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SGKW0K
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Rosenman, Evan; Olivella, Santiago; Imai, Kosuke
Description
We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14043791.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jinseok Kim; Jenna Kim; Jason Owen-Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
d
Data from: Validated Names for Experimental Studies on Race and Ethnicity
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crabtree, Charles; Kim, Jae Yeon (2023). Validated Names for Experimental Studies on Race and Ethnicity [Dataset]. http://doi.org/10.7910/DVN/LP4EAR
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/LP4EAR
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Crabtree, Charles; Kim, Jae Yeon
Description
A large and fast-growing number of studies across the social sciences use experiments to better understand the role of race in human interactions, particularly in the American context. Researchers often use names to signal the race of individuals portrayed in these experiments. However, those names might also signal other attributes, such as socioeconomic status (e.g., education and income) and citizenship. If they do, researchers need pre-tested names with data on perceptions of these attributes. Such data would permit researchers to draw correct inferences about the causal effect of race in their experiments. In this paper, we provide the largest dataset of validated name perceptions based on three different surveys conducted in the United States. In total, our data include over 44,170 name evaluations from 4,026 respondents for 600 names. In addition to respondent perceptions of race, income, education, and citizenship from names, our data also include respondent characteristics. Our data will be broadly helpful for researchers conducting experiments on the manifold ways in which race shapes American life.
d
Popular Baby Names
catalog.data.gov
data.cityofnewyork.us
+3more
Updated Jul 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2025). Popular Baby Names [Dataset]. https://catalog.data.gov/dataset/popular-baby-names
Explore at:
Dataset updated
Jul 12, 2025
Dataset provided by
data.cityofnewyork.us
Description
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
H
Data for: Demographic aspects of first names
dataverse.harvard.edu
Updated Mar 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Tzioumis (2018). Data for: Demographic aspects of first names [Dataset]. http://doi.org/10.7910/DVN/TYJKEZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/TYJKEZ
Dataset updated
Mar 12, 2018
Dataset provided by
Harvard Dataverse
Authors
Konstantinos Tzioumis
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The list includes 4,250 first names and information on their respective count and proportions across six mutually exclusive racial and Hispanic origin groups. These six categories are consistent with the categories used in the Census Bureau's surname list.
N
MOST POPULAR NAMES NYC
nycopendata.socrata.com
data.cityofnewyork.us
application/rdfxml +5
Updated Jun 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health and Mental Hygiene (DOHMH) (2025). MOST POPULAR NAMES NYC [Dataset]. https://nycopendata.socrata.com/widgets/7v44-25wq
Explore at:
csv, application/rdfxml, xml, application/rssxml, tsv, jsonAvailable download formats
Dataset updated
Jun 8, 2025
Authors
Department of Health and Mental Hygiene (DOHMH)
Area covered
New York
Description
The most popular baby names by sex and mother's ethnicity in New York City.
H
Replication data for: What's in a name? A Method for Extracting Information...
dataverse.harvard.edu
Updated Dec 15, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew ew Harris (2014). Replication data for: What's in a name? A Method for Extracting Information about Ethnicity from Names [Dataset]. http://doi.org/10.7910/DVN/27691
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/27691
Dataset updated
Dec 15, 2014
Dataset provided by
Harvard Dataverse
Authors
Andrew ew Harris
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2014
Area covered
Kenya, US
Description
Contains replication materials for the article.
f
The Census-based surname lookup table for Ethnicity Estimator.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jens Kandt; Paul A. Longley (2023). The Census-based surname lookup table for Ethnicity Estimator. [Dataset]. http://doi.org/10.1371/journal.pone.0201774.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201774.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Jens Kandt; Paul A. Longley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Census-based surname lookup table for Ethnicity Estimator.
d
Replication Data for: Signaling Race, Ethnicity, and Gender with Names:...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hayes, Matthew; Elder, Elizabeth Mitchell (2023). Replication Data for: Signaling Race, Ethnicity, and Gender with Names: Challenges and Recommendations [Dataset]. http://doi.org/10.7910/DVN/0LCYN5
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/0LCYN5
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Hayes, Matthew; Elder, Elizabeth Mitchell
Description
A growing body of research uses names to cue experimental subjects about race, ethnicity, and gender. However, researchers have not explored the myriad of characteristics that might be signaled by these names. In this paper, we introduce a large, publicly available database of the attributes associated with common American first and last names. For 1,000 first names and 21 last names, we provide ratings of perceived race; for 336 first names, we provide ratings on 26 social and personal characteristics. We show that the traits associated with first names vary widely, even among names associated with the same race and gender. Researchers using names to signal group memberships are thus likely cuing a number of other attributes as well. We demonstrate the importance of name selection by replicating DeSante (2013). We conclude by outlining two approaches researchers can use to choose names that successfully cue race (and gender) while minimizing potential confounds.
Demographics Data Package
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Demographics Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/demographics-data-package/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Description
This data package consists of 26 datasets all containing statistical data relating to the population and particular groups within it belonging to different countries, mostly the United States.
Most Popular Baby Names by Gender and Mother Ethnic Group
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Most Popular Baby Names by Gender and Mother Ethnic Group [Dataset]. https://www.johnsnowlabs.com/marketplace/most-popular-baby-names-by-gender-and-mother-ethnic-group/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Time period covered
2011 - 2019
Area covered
United States
Description
This dataset is from the BabyCenter that has released its top 100 baby names of 2016, showing which names proved to be the most popular this year. Baby names are often a controversial subject, considering seemingly everyone has an opinion of which name sounds best and which are too out there to use. But there are a select few names out there that seem to be universally beloved.
A
‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-nyc-most-popular-baby-names-over-the-years-94c5/3fb35e8b/?iid=003-998&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Analysis of ‘NYC Most Popular Baby Names Over the Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/most-popular-baby-names-in-nyce on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Popular Baby Name Data In NYC from 2011-2014

Rows: 13962; Columns: 6

The data include items, such as:

BRTH_YR: birth year the baby

GNDR: gender

ETHCTY: mother's ethnicity

NM: baby's name

CNT: count of the name

RNK: ranking of the name

Source: NYC Open Data

https://data.cityofnewyork.us/Health/Most-Popular-Baby-Names-by-Sex-and-Mother-s-Ethnic/25th-nujf

This dataset was created by Data Society and contains around 10000 samples along with Nm, Rnk, technical information and other features such as: - Gndr - Ethcty - and more.

How to use this dataset

Analyze Brth Yr in relation to Cnt

Study the influence of Nm on Rnk

More datasets

Acknowledgements

If you use this dataset in your research, please credit Data Society

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
f
Data from: Using First Name Information to Improve Race and Ethnicity...
tandf.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ioan Voicu (2023). Using First Name Information to Improve Race and Ethnicity Classification [Dataset]. http://doi.org/10.6084/m9.figshare.5813859.v2
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5813859.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Ioan Voicu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article uses a recent first name list to develop an improvement to an existing Bayesian classifier, namely the Bayesian Improved Surname Geocoding (BISG) method, which combines surname and geography information to impute missing race/ethnicity. The new Bayesian Improved First Name Surname Geocoding (BIFSG) method is validated using a large sample of mortgage applicants who self-report their race/ethnicity. BIFSG outperforms BISG, in terms of accuracy and coverage, for all major racial/ethnic categories. Although the overall magnitude of improvement is somewhat small, the largest improvements occur for non-Hispanic Blacks, a group for which the BISG performance is weakest. When estimating the race/ethnicity effects on mortgage pricing and underwriting decisions with regression models, estimation biases from both BIFSG and BISG are very small, with BIFSG generally having smaller biases, and the maximum a posteriori classifier resulting in smaller biases than through use of estimated probabilities. Robustness checks using voter registration data confirm BIFSG's improved performance vis-a-vis BISG and illustrate BIFSG's applicability to areas other than mortgage lending. Finally, I demonstrate an application of the BIFSG to the imputation of missing race/ethnicity in the Home Mortgage Disclosure Act data, and in the process, offer novel evidence that the incidence of missing race/ethnicity information is correlated with race/ethnicity.
Nyc popular baby names
kaggle.com
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Sarkar (2022). Nyc popular baby names [Dataset]. https://www.kaggle.com/datasets/rahulsarkar221/nyc-popular-baby-names
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2022
Dataset provided by
Kaggle
Authors
Rahul Sarkar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
New York
Description
This data contains popular baby names in New York .

Dataset :- 1 file (popular-baby-names.csv)

Columns - Year of Birth : Year of the baby's birth. - Gender : Gender of the baby. - Ethnicity : Types of ethnicity they belong to. - Child's First Name : The first name of the child. - Count : How many babies were named . - Ranking : Ranking of that name.
d
Data from: Signaling Race, Ethnicity, and Gender with Names: Challenges and...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elder, Elizabeth (2023). Signaling Race, Ethnicity, and Gender with Names: Challenges and Recommendations. [Dataset]. http://doi.org/10.7910/DVN/47CZDX
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/47CZDX
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Elder, Elizabeth
Description
Data on perceived characteristics of first and last names. Forthcoming at the Journal of Politics; this Dataverse will be deleted when the official JOP replication archive is made available.
f
Ethnicity estimation using family naming practices
plos.figshare.com
ai
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jens Kandt; Paul A. Longley (2023). Ethnicity estimation using family naming practices [Dataset]. http://doi.org/10.1371/journal.pone.0201774
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201774
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jens Kandt; Paul A. Longley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper examines the association between given and family names and self-ascribed ethnicity as classified by the 2011 Census of Population for England and Wales. Using Census data in an innovative way under the new Office for National Statistics (ONS) Secure Research Service (SRS; previously the ONS Virtual Microdata Laboratory, VML), we investigate how bearers of a full range of given and family names assigned themselves to 2011 Census categories, using a names classification tool previously described in this journal. Based on these results, we develop a follow-up ethnicity estimation tool and describe how the tool may be used to observe changing relations between naming practices and ethnic identities as a facet of social integration and cosmopolitanism in an increasingly diverse society.
Opinion on Cleveland Indians changing their name in the U.S. 2021, by...
statista.com
Updated Jul 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2021). Opinion on Cleveland Indians changing their name in the U.S. 2021, by ethnicity [Dataset]. https://www.statista.com/statistics/1253757/cleveland-indians-changing-name-ethnicity/
Explore at:
Dataset updated
Jul 28, 2021
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jul 23, 2021 - Jul 26, 2021
Area covered
United States
Description
Ahead of the 2022 Major League Baseball (MLB) season, the Cleveland Indians announced that the team would henceforth be known as the Cleveland Guardians. This move came as a result of increasing pressure from fans and sponsor to replace the original Native American-based name. During a July 2021 survey in the United States, 56 percent of Black respondents approved of the franchise's decision to change its name following the current season, while this figure fell to 33 percent among white respondents.
Race and Ethnicity 2018-2022 - STATES
mce-data-uscensus.hub.arcgis.com
covid19-uscensus.hub.arcgis.com
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Census Bureau (2024). Race and Ethnicity 2018-2022 - STATES [Dataset]. https://mce-data-uscensus.hub.arcgis.com/maps/973245d9cd914f58a8fe87baacea1f4a
Explore at:
Dataset updated
Feb 5, 2024
Dataset provided by
United States Census Bureauhttp://census.gov/
Authors
US Census Bureau
Area covered
Description
This layer shows Race and Ethnicity. This is shown by state and county boundaries. This service contains the 2018-2022 release of data from the American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percentage of population that are Hispanic or Latino (of any race). To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2018-2022ACS Table(s): B02001, B03001, DP05Data downloaded from: CensusBureau's API for American Community Survey Date of API call: January 18, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the Cartographic Boundaries via US Census TIGER geodatabases. Boundaries are updated at the same time as the data updates, and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines clipped for cartographic purposes. For state and county boundaries, the water and coastlines are derived from the coastlines of the 500k TIGER Cartographic Boundary Shapefiles. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto Rico. The Counties (and equivalent) layer contains 3221 records - all counties and equivalent, Washington D.C., and Puerto Rico municipios. See Areas Published. Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells.Margin of error (MOE) values of -555555555 in the API (or "*****" (five asterisks) on data.census.gov) are displayed as 0 in this dataset. The estimates associated with these MOEs have been controlled to independent counts in the ACS weighting and have zero sampling error. So, the MOEs are effectively zeroes, and are treated as zeroes in MOE calculations. Other negative values on the API, such as -222222222, -666666666, -888888888, and -999999999, all represent estimates or MOEs that can't be calculated or can't be published, usually due to small sample sizes. All of these are rendered in this dataset as null (blank) values.
Names By Nationality
kaggle.com
Updated Oct 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hemendra Singh Rajawat (2020). Names By Nationality [Dataset]. https://www.kaggle.com/datasets/hemendrasr/name-by-nationality
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hemendra Singh Rajawat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Hemendra Singh Rajawat

Released under CC0: Public Domain

Contents
f
Distribution of first name and last name frequencies by country
figshare.com
xlsx
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21956795.v2
Dataset updated
Feb 2, 2023
Dataset provided by
figshare
Authors
Mike Thelwall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distribution of first and last name frequencies of academic authors by country.

Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China

Facebook

Twitter

Click to copy link

Link copied

Cite

Rosenman, Evan; Olivella, Santiago; Imai, Kosuke (2023). Race and ethnicity data for first, middle, and last names [Dataset]. http://doi.org/10.7910/DVN/SGKW0K

Race and ethnicity data for first, middle, and last names

Explore at:

21 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.7910/DVN/SGKW0K

Dataset updated

Nov 8, 2023

Dataset provided by

Harvard Dataverse

Authors

Rosenman, Evan; Olivella, Santiago; Imai, Kosuke

Description

We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.

Clear search

Close search

Google apps

Main menu

Race and ethnicity data for first, middle, and last names

Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

Data from: Validated Names for Experimental Studies on Race and Ethnicity

Popular Baby Names

Data for: Demographic aspects of first names

MOST POPULAR NAMES NYC

Replication data for: What's in a name? A Method for Extracting Information...

The Census-based surname lookup table for Ethnicity Estimator.

Replication Data for: Signaling Race, Ethnicity, and Gender with Names:...

Demographics Data Package

Most Popular Baby Names by Gender and Mother Ethnic Group

‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2

About this dataset

Popular Baby Name Data In NYC from 2011-2014

How to use this dataset

Acknowledgements

Start A New Notebook!

Data from: Using First Name Information to Improve Race and Ethnicity...

Nyc popular baby names

Data from: Signaling Race, Ethnicity, and Gender with Names: Challenges and...

Ethnicity estimation using family naming practices

Opinion on Cleveland Indians changing their name in the U.S. 2021, by...

Race and Ethnicity 2018-2022 - STATES

Names By Nationality

Dataset

Contents

Distribution of first name and last name frequencies by country

Race and ethnicity data for first, middle, and last names