100+ datasets found

American Names by Multi-Ethnic/National Origin
kaggle.com
zip
Updated Aug 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Teitelbaum (2023). American Names by Multi-Ethnic/National Origin [Dataset]. https://www.kaggle.com/datasets/louisteitelbaum/american-names-by-multi-ethnic-national-origin
Explore at:
zip(778154 bytes)Available download formats
Dataset updated
Aug 22, 2023
Authors
Louis Teitelbaum
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
United States
Description
This dataset includes all personal names listed in the Wikipedia category “American people by ethnic or national origin” and all subcategories fitting the pattern “American People of [ ] descent”, in total more than 25,000 individuals. Each individual is represented by a row, with columns indicating binary membership (0/1) in each ethnic/national category.

Ethnicity inference is an essential tool for identifying disparities in public health and social sciences. Existing datasets linking personal names to ethnic or national origin often neglect to recognize multi-ethnic or multi-national identities. Furthermore, existing datasets use coarse classification schemes (e.g. classifying both Indian and Japanese people as “Asian”) that may not be suitable for many research questions. This dataset remedies these problems by including both very fine-grain ethnic/national categories (e.g. Afghan-Jewish) and more broad ones (e.g. European). Users can chose the categories that are relevant to their research. Since many Americans on Wikipedia are associated with multiple overlapping or distinct ethnicities/nationalities, these multi-ethnic associations are also reflected in the data.

Data were obtained from the Wikipedia API and reviewed manually to remove stage names, pen names, mononyms, first initials (when full names are available on Wikipedia), nicknames, honorific titles, and pages that correspond to a group or event rather than an individual.

This dataset was designed for use in training classification algorithms, but may also be independently interesting inasmuch as it is a representative sample of Americans who are famous enough to have their own Wikipedia page, along with detailed information on their ethnic/national origins.

DISCLAIMER: Due to the incomplete nature of Wikipedia, data may not properly reflect all ethnic national associations for any given individual. For example, there is no guarantee that a given Cuban Jewish person will be listed in both the “American People of Cuban descent” and the “American People of Jewish descent” categories.
d
Popular Baby Names
catalog.data.gov
data.cityofnewyork.us
+5more
Updated Jul 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2025). Popular Baby Names [Dataset]. https://catalog.data.gov/dataset/popular-baby-names
Explore at:
Dataset updated
Jul 12, 2025
Dataset provided by
data.cityofnewyork.us
Description
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
Baby Names from Social Security Card Applications - National Data
catalog.data.gov
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
Explore at:
Dataset updated
Jul 4, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
Namesakes
figshare.com
json
Updated Nov 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17009105.v1
Dataset updated
Nov 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Motivation: creating challenging dataset for testing Named-Entity

Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

Entities were collected as Wikipedia

text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

Usage NotesEntities:

File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
USA Names
console.cloud.google.com
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Social%20Security%20Administration&hl=de (2023). USA Names [Dataset]. https://console.cloud.google.com/marketplace/product/social-security-administration/us-names?hl=de
Explore at:
Dataset updated
Jul 15, 2023
Dataset provided by
Googlehttp://google.com/
Area covered
United States
Description
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data. All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Baby Names DataSet
kaggle.com
zip
Updated Mar 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samrat Rai (2019). Baby Names DataSet [Dataset]. https://www.kaggle.com/samrat77/baby-names-dataset
Explore at:
zip(87271 bytes)Available download formats
Dataset updated
Mar 21, 2019
Authors
Samrat Rai
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Indian Names Dataset
kaggle.com
zip
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Malhotra (2023). Indian Names Dataset [Dataset]. https://www.kaggle.com/datasets/harshm27/indian-names-dataset
Explore at:
zip(1128 bytes)Available download formats
Dataset updated
Jul 19, 2023
Authors
Harsh Malhotra
Description
A dataset full of first names can be incredibly helpful in various applications and analyses. Firstly, it can be used in demographic studies to analyze naming trends and patterns over time, providing insights into cultural and societal changes. Additionally, such a dataset can be utilized in market research and targeted advertising, allowing businesses to personalize their marketing strategies based on customers' names. It can also be employed in language processing tasks, such as name entity recognition, sentiment analysis, or gender prediction. Moreover, the dataset can serve as a valuable resource for generating test data, creating fictional characters, or enhancing natural language generation models. Overall, a comprehensive first name dataset has diverse applications across multiple domains.

This dataset contains common Indian names, this data can be used for any sort of NLP problems such as name generation etc. An upvote would be appreciated :)
u
Labelled FHYA Dataset
zivahub.uct.ac.za
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarryd Dunn (2022). Labelled FHYA Dataset [Dataset]. http://doi.org/10.25375/uct.19029692.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.19029692.v1
Dataset updated
Feb 2, 2022
Dataset provided by
University of Cape Town
Authors
Jarryd Dunn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
USA Name Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datagov/usa-names
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Data.govhttps://data.gov/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

Content

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

https://cloud.google.com/bigquery/public-data/usa-names

Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @dcp from Unplash.

Inspiration

What are the most common names?

What are the most common female names?

Are there more female or male names?

Female names by a wide margin?
d
Popular Baby Names - Dataset - data.sa.gov.au
data.sa.gov.au
Updated Mar 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Popular Baby Names - Dataset - data.sa.gov.au [Dataset]. https://data.sa.gov.au/data/dataset/popular-baby-names
Explore at:
Dataset updated
Mar 1, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Australia
Description
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
Top 100 baby names in England and Wales: historical data
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2025). Top 100 baby names in England and Wales: historical data [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalestop100babynameshistoricaldata
Explore at:
xlsxAvailable download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Historic lists of top 100 names for baby boys and girls for 1904 to 2024 at 10-yearly intervals.
Distribution of first name and last name frequencies by country
figshare.com
xlsx
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21956795.v2
Dataset updated
Feb 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Mike Thelwall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distribution of first and last name frequencies of academic authors by country.

Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China
Popular 5000 Last Names in the US
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular 5000 Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-5000-last-names-in-the-us/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
United States
Description
This dataset represents the 5000 popular last names in the United States. The data is split by race to show the percentages against each option.
O
Top 100 Baby Names
data.qld.gov.au
researchdata.edu.au
+1more
csv
Updated Feb 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justice (2025). Top 100 Baby Names [Dataset]. https://www.data.qld.gov.au/dataset/top-100-baby-names
Explore at:
csv(2 KiB), csv, csv(200 KiB)Available download formats
Dataset updated
Feb 13, 2025
Dataset authored and provided by
Justice
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Queensland Top 100 Baby Names
Baby names for girls in England and Wales
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2025). Baby names for girls in England and Wales [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsgirls
Explore at:
xlsxAvailable download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Rank and count of the top names for baby girls, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Facebook Profiles Datasets
brightdata.com
.json, .csv, .xlsx
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Facebook Profiles Datasets [Dataset]. https://brightdata.com/products/datasets/facebook/profiles
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Use our Facebook Profiles dataset to explore public profile details such as names, profile and cover photos, work history, education, and photo galleries. Common use cases include people and company research, influencer discovery, and academic studies of career and education signals on Facebook. Over 31M records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

Profile URL Profile Name Facebook Profile ID Profile Photo Cover Photo Work History (Title, Company, Company ID, Company URL, Start/End Dates) College Education (Name, ID, URL) High School Education (Name, ID, URL) Photo Galleries And much more
Popular White Last Names in the US
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
United States
Description
This dataset represents the popular last names in the United States for White.
Baby names for boys in England and Wales
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2025). Baby names for boys in England and Wales [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsboys
Explore at:
xlsxAvailable download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Rank and count of the top names for baby boys, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Z
glenglat: Global englacial temperature database
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Welty, Ethan; Jacquemart, Mylène; Carcanade, Guillem; Gastaldello, Marcus; Van Tricht, Lander; Flowers, Gwenn E.; Sugiyama, Shin; Gurung, Tika Ram; Prinz, Rainer; Abermann, Jakob; Steiner, Jakob Friedrich; Barandun, Martina; Gagliardini, Olivier; Vincent, Christian; Thompson, Lonnie G.; Zhang, Tong; Hoelzle, Martin (2025). glenglat: Global englacial temperature database [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_11516611
Explore at:
Dataset updated
Mar 13, 2025
Dataset provided by
State Key Laboratory of Earth Surface Processes and Resource Ecology (ESPRE), Beijing Normal University (BNU), China
Institute of Low Temperature Science (ILTS), Hokkaido University, Japan
Department of Atmospheric and Cryospheric Sciences (ACINN), University of Innsbruck (UIBK), Austria
Byrd Polar and Climate Research Center
Department of Earth Sciences, Simon Fraser University (SFU), Canada
Department of Geosciences, University of Fribourg (UniFr), Switzerland
World Glacier Monitoring Service
Laboratory of Hydraulics, Hydrology and Glaciology (VAW), ETH Zurich, Switzerland
Institut des Géosciences de l'Environnement
Department of Water and Climate, Vrije Universiteit Brussel (VUB), Belgium
Department of Geography and Regional Science, University of Graz, Austria
International Centre for Integrated Mountain Development
Authors
Welty, Ethan; Jacquemart, Mylène; Carcanade, Guillem; Gastaldello, Marcus; Van Tricht, Lander; Flowers, Gwenn E.; Sugiyama, Shin; Gurung, Tika Ram; Prinz, Rainer; Abermann, Jakob; Steiner, Jakob Friedrich; Barandun, Martina; Gagliardini, Olivier; Vincent, Christian; Thompson, Lonnie G.; Zhang, Tong; Hoelzle, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open-access database of englacial temperature measurements compiled from data submissions and published literature. It is developed on GitHub and published to Zenodo. The dataset is described in the following publication:

Mylène Jacquemart, Ethan Welty, Marcus Gastaldello, and Guillem Carcanade (2025). glenglat: A database of global englacial temperatures. Earth System Science Data Discussions. https://doi.org/10.5194/essd-2024-249

Dataset structure

The dataset adheres to the Frictionless Data Tabular Data Package specification. The metadata in datapackage.json describes, in detail, the contents of the tabular data files in the data folder:

source.csv: Description of each data source (either a personal communication or the reference to a published study).

borehole.csv: Description of each borehole (location, elevation, etc), linked to source.csv via source_id and less formally via source identifiers in notes.

profile.csv: Description of each profile (date, etc), linked to borehole.csv via borehole_id and to source.csv via source_id and less formally via source identifiers in notes.

measurement.csv: Description of each measurement (depth and temperature), linked to profile.csv via borehole_id and profile_id.

For boreholes with many profiles (e.g. from automated loggers), pairs of profile.csv and measurement.csv are stored separately in subfolders of data named {source.id}-{glacier}, where glacier is a simplified and kebab-cased version of the glacier name (e.g. flowers2022-little-kluane).

Supporting information

The folder sources, available on GitHub but omitted from dataset releases on Zenodo, contains subfolders (with names matching column source.id) with files that document how and from where the data was extracted.

Tables

Jump to: source · borehole · profile · measurement

source

Sources of information considered in the compilation of this database. Column names and categorical values closely follow the Citation Style Language (CSL) 1.0.2 specification. Names of people in non-Latin scripts are followed by a latinization in square brackets (e.g. В. С. Загороднов [V. S. Zagorodnov]) and non-English titles are followed by a translation in square brackets. The family name of Latin-script names is wrapped in curly braces when it is not the last word of the name (e.g. Emmanuel {Le Meur}, e.g. {Duan} Keqin) or the name ends in two or more unabbreviated words (e.g. Jon Ove {Hagen}). The family name of a Chinese name (and of the latinization) is wrapped in curly braces when it is not the first character.

name type description

id (required) string Unique identifier constructed from the first author's lowercase, latinized, family name and the publication year, followed as needed by a lowercase letter to ensure uniqueness (e.g. Загороднов 1981 → zagorodnov1981a).

author string Author names (optionally followed by their ORCID or contact email in parentheses) as a pipe-delimited list.

year (required) year Year of publication.

type (required) string Item type.- article-journal: Journal article- book: Book (if the entire book is relevant)- chapter: Book section- document: Document not fitting into any other category- dataset: Collection of data- map: Geographic map- paper-conference: Paper published in conference proceedings- personal-communication: Personal communication between individuals- speech: Presentation (talk, poster) at a conference- report: Report distributed by an institution- thesis-phd: Doctor of Philosophy (PhD) thesis- thesis-msc: Master of Science (MSc) thesis- webpage: Website or page on a website

title (required) string Item title.

url string URL (DOI if available).

language (required) string Language as ISO 639-1 two-letter language code.- da: Danish- de: German- en: English- es: Spanish- fr: French- ja: Japanese- ko: Korean- ru: Russian- sv: Swedish- zh: Chinese

container_title string Title of the container (e.g. journal, book).

volume integer Volume number of the item or container.

issue string Issue number (e.g. 1) or range (e.g. 1-2) of the item or container, with an optional letter prefix (e.g. F1) or part number (e.g. 75pt2).

page string Page number (e.g. 1) or range (e.g. 1-2) of the item in the container, with an optional letter prefix (e.g. S1).

version string Version number (e.g. 1.0) of the item.

editor string Editor names (e.g. of the containing book) as a pipe-delimited list.

collection_title string Title of the collection (e.g. book series).

collection_number string Number (e.g. 1) or range (e.g. 1-2) in the collection (e.g. book series volume).

publisher string Publisher name.

borehole

Metadata about each borehole.

name type description

id (required) integer Unique identifier.

source_id (required) string Identifier of the source of the earliest temperature measurements. This is also the source of the borehole attributes unless otherwise stated in notes.

glacier_name (required) string Glacier or ice cap name (as reported).

glims_id string Global Land Ice Measurements from Space (GLIMS) glacier identifier.

location_origin (required) string Origin of location (latitude, longitude).- submitted: Provided in data submission- published: Reported as coordinates in original publication- digitized: Digitized from published map with complete axes- estimated: Estimated from published plot by comparing to a map (e.g. Google Maps, CalTopo)- guessed: Estimated with difficulty, for example by comparing elevation to a map (e.g. Google Maps, CalTopo)

latitude (required) number [degree] Latitude (EPSG 4326).

longitude (required) number [degree] Longitude (EPSG 4326).

elevation_origin (required) string Origin of elevation (elevation).- submitted: Provided in data submission- published: Reported as number in original publication- digitized: Digitized from published plot with complete axes- estimated: Estimated from elevation contours in published map- guessed: Estimated with difficulty, for example by comparing location (latitude, longitude) to a map of contemporary elevations (e.g. CalTopo, Google Maps)

elevation (required) number [m] Elevation above sea level.

mass_balance_area string Mass balance area.- ablation: Ablation area- equilibrium: Near the equilibrium line- accumulation: Accumulation area

label string Borehole name (e.g. as labeled on a plot).

date_min date (%Y-%m-%d) Begin date of drilling, or if not known precisely, the first possible date (e.g. 2019 → 2019-01-01).

date_max date (%Y-%m-%d) End date of drilling, or if not known precisely, the last possible date (e.g. 2019 → 2019-12-31).

drill_method string Drilling method.- mechanical: Push, percussion, rotary- thermal: Hot point, electrothermal, steam- combined: Mechanical and thermal

ice_depth number [m] Starting depth of continuous ice. Infinity (INF) indicates that only snow, firn, or intermittent ice was reached.

depth number [m] Total borehole depth (not including drilling in the underlying bed).

to_bed boolean Whether the borehole reached the glacier bed.

temperature_uncertainty number [°C] Estimated temperature uncertainty (as reported).

notes string Additional remarks about the study site, the borehole, or the measurements therein as a pipe-delimited list. Sources are referenced by source.id. Quality concerns are prefixed with '[flag]'.

curator string Names of people who added the data to the database, as a pipe-delimited list.

investigators string Names of people and/or agencies who performed the work, as a pipe-delimited list. Each entry is in the format 'person (agency; ...) {notes}', where only person or one agency is required. Person and agency may contain a latinized form in square brackets.

funding string Funding sources as a pipe-delimited list. Each entry is in the format 'funder [rorid] > award [number] url', where only funder is required and rorid is the funder's ROR (https://ror.org) ID (e.g. 01jtrvx49).

profile

Date and time of each measurement profile.

name type description

borehole_id (required) integer Borehole identifier.

id (required) integer Borehole profile identifier (starting from 1 for each borehole).

source_id (required) string Source identifier.

measurement_origin (required) string Origin of measurements (measurement.depth, measurement.temperature).- submitted: Provided as numbers in data submission- published: Numbers read from original publication- digitized-discrete: Digitized with Plot Digitizer from discrete points of depth versus temperature- digitized-continuous: Digitized with Plot Digitizer from a continuous data source (e.g. line plot of depth versus temperature)

date_min date (%Y-%m-%d) Measurement date, or if not known precisely, the first possible date (e.g. 2019 → 2019-01-01).

date_max (required) date (%Y-%m-%d) Measurement date, or if not known precisely, the last possible date (e.g. 2019 → 2019-12-31).

time time (%H:%M:%S) Measurement time.

utc_offset number [h] Time offset relative to Coordinated Universal Time (UTC).

equilibrium string Whether and how reported temperatures equilibrated following drilling.- true: Equilibrium was measured- estimated: Equilibrium was estimated (typically by extrapolation)- false: Equilibrium was not reached

notes string Additional remarks about the profile or the measurements therein as a pipe-delimited list. Sources are referenced by source.id. Quality concerns are prefixed with '[flag]'.

measurement

Temperature measurements with depth.

name type description

borehole_id (required) integer Borehole identifier.

profile_id (required) integer Borehole profile identifier.

depth (required) number [m] Depth below the glacier surface.

temperature (required) number [°C] Temperature.
d
Data on Alaskan Population demographics ranging from 1940 to 2015
dataone.org
search.dataone.org
+1more
Updated Feb 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Census Bureau; Juliet Bachtel; John Randazzo (2019). Data on Alaskan Population demographics ranging from 1940 to 2015 [Dataset]. http://doi.org/10.5063/F1CV4FZX
Explore at:
Unique identifier
https://doi.org/10.5063/F1CV4FZX
Dataset updated
Feb 7, 2019
Dataset provided by
Knowledge Network for Biocomplexity
Authors
United States Census Bureau; Juliet Bachtel; John Randazzo
Time period covered
Jan 1, 1940 - Dec 31, 2015
Area covered

Variables measured
lat, lng, Year, city, ANVSA, Negro, Other, Place, White, Aleut., and 138 more
Description
These data comprise Census records relating to the Alaskan people's population demographics for the State of Alaskan Salmon and People (SASAP) Project. Decennial census data were originally extracted from IPUMS National Historic Geographic Information Systems website: https://data2.nhgis.org/main(Citation: Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 12.0 [Database]. Minneapolis: University of Minnesota. 2017. http://doi.org/10.18128/D050.V12.0). A number of relevant tables of basic demographics on age and race, household income and poverty levels, and labor force participation were extracted.

These particular variables were selected as part of an effort to understand and potentially quantify various dimensions of well-being in Alaskan communities. The file "censusdata_master.csv" is a consolidation of all 21 other data files in the package. For detailed information on how the datasets vary over different years, view the file "readme.docx" available in this data package. The included .Rmd file is a script which combines the 21 files by year into a single file (censusdata_master.csv). It also cleans up place names (including typographical errors) and uses the USGS place names dataset and the SASAP regions dataset to assign latitude and longitude values and region values to each place in the dataset. Note that some places were not assigned a region or location because they do not fit well into the regional framework. Considerable heterogeneity exists between census surveys each year. While we have attempted to combine these datasets in a way that makes sense, there may be some discrepancies or unexpected values. Please send a description of any unusual values to the dataset contact.

Facebook

Twitter

Click to copy link

Link copied

Cite

Louis Teitelbaum (2023). American Names by Multi-Ethnic/National Origin [Dataset]. https://www.kaggle.com/datasets/louisteitelbaum/american-names-by-multi-ethnic-national-origin

American Names by Multi-Ethnic/National Origin

25,540 Americans, 491 Overlapping Ethnic/National Categories

Explore at:

zip(778154 bytes)Available download formats

Dataset updated

Aug 22, 2023

Authors

Louis Teitelbaum

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered

United States

Description

This dataset includes all personal names listed in the Wikipedia category “American people by ethnic or national origin” and all subcategories fitting the pattern “American People of [ ] descent”, in total more than 25,000 individuals. Each individual is represented by a row, with columns indicating binary membership (0/1) in each ethnic/national category.

Ethnicity inference is an essential tool for identifying disparities in public health and social sciences. Existing datasets linking personal names to ethnic or national origin often neglect to recognize multi-ethnic or multi-national identities. Furthermore, existing datasets use coarse classification schemes (e.g. classifying both Indian and Japanese people as “Asian”) that may not be suitable for many research questions. This dataset remedies these problems by including both very fine-grain ethnic/national categories (e.g. Afghan-Jewish) and more broad ones (e.g. European). Users can chose the categories that are relevant to their research. Since many Americans on Wikipedia are associated with multiple overlapping or distinct ethnicities/nationalities, these multi-ethnic associations are also reflected in the data.

Data were obtained from the Wikipedia API and reviewed manually to remove stage names, pen names, mononyms, first initials (when full names are available on Wikipedia), nicknames, honorific titles, and pages that correspond to a group or event rather than an individual.

This dataset was designed for use in training classification algorithms, but may also be independently interesting inasmuch as it is a representative sample of Americans who are famous enough to have their own Wikipedia page, along with detailed information on their ethnic/national origins.

DISCLAIMER: Due to the incomplete nature of Wikipedia, data may not properly reflect all ethnic national associations for any given individual. For example, there is no guarantee that a given Cuban Jewish person will be listed in both the “American People of Cuban descent” and the “American People of Jewish descent” categories.

Clear search

Close search

Google apps

Main menu

American Names by Multi-Ethnic/National Origin

Popular Baby Names

Baby Names from Social Security Card Applications - National Data

Namesakes

USA Names

Baby Names DataSet

Context

Content

Acknowledgements

Inspiration

Indian Names Dataset

Labelled FHYA Dataset

USA Name Data

Context

Content

Acknowledgements

Inspiration

Popular Baby Names - Dataset - data.sa.gov.au

Top 100 baby names in England and Wales: historical data

Distribution of first name and last name frequencies by country

Popular 5000 Last Names in the US

Top 100 Baby Names

Baby names for girls in England and Wales

Facebook Profiles Datasets

Popular White Last Names in the US

Baby names for boys in England and Wales

glenglat: Global englacial temperature database

Data on Alaskan Population demographics ranging from 1940 to 2015

American Names by Multi-Ethnic/National Origin

25,540 Americans, 491 Overlapping Ethnic/National Categories