100+ datasets found
  1. USA Name Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datasets/datagov/usa-names
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

    Content

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

    All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

    https://cloud.google.com/bigquery/public-data/usa-names

    Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @dcp from Unplash.

    Inspiration

    What are the most common names?

    What are the most common female names?

    Are there more female or male names?

    Female names by a wide margin?

  2. d

    Race and ethnicity data for first, middle, and last names

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosenman, Evan; Olivella, Santiago; Imai, Kosuke (2023). Race and ethnicity data for first, middle, and last names [Dataset]. http://doi.org/10.7910/DVN/SGKW0K
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Rosenman, Evan; Olivella, Santiago; Imai, Kosuke
    Description

    We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.

  3. U.S. First Names: Popularity and Counts

    • kaggle.com
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Fedorov (2025). U.S. First Names: Popularity and Counts [Dataset]. https://www.kaggle.com/datasets/downshift/u-s-first-names-popularity-and-counts/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Daniel Fedorov
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    This dataset contains counts and rankings of the most common first names in the United States, sourced from comprehensive name census data. It is ideal for analyzing naming trends, demographic patterns, and cultural preferences, as well as for building statistical models to explore name popularity over time.

    Dataset structure

    male_first_names.csv: Male first name frequencies and rankings in the U.S.

    female_first_names.csv: Female first name frequencies and rankings in the U.S.

  4. a

    Facebook Names Dataset

    • academictorrents.com
    bittorrent
    Updated Nov 11, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ron Bowes (Skull Security) (2015). Facebook Names Dataset [Dataset]. https://academictorrents.com/details/e54c73099d291605e7579b90838c2cd86a8e9575
    Explore at:
    bittorrent(2991052604)Available download formats
    Dataset updated
    Nov 11, 2015
    Dataset authored and provided by
    Ron Bowes (Skull Security)
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    171 million names (100 million unique) This torrent contains: The URL of every searchable Facebook user s profile The name of every searchable Facebook user, both unique and by count (perfect for post-processing, datamining, etc) Processed lists, including first names with count, last names with count, potential usernames with count, etc The programs I used to generate everything So, there you have it: lots of awesome data from Facebook. Now, I just have to find one more problem with Facebook so I can write "Revenge of the Facebook Snatchers" and complete the trilogy. Any suggestions? >:-) Limitations So far, I have only indexed the searchable users, not their friends. Getting their friends will be significantly more data to process, and I don t have those capabilities right now. I d like to tackle that in the future, though, so if anybody has any bandwidth they d like to donate, all I need is an ssh account and Nmap installed. An additional limitation is that these are on

  5. Baby Names from Social Security Card Applications - National Data

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.

  6. Customer Names Dataset

    • kaggle.com
    Updated Sep 3, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susham Nandi (2020). Customer Names Dataset [Dataset]. https://www.kaggle.com/sushamnandi/customer-names-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Susham Nandi
    Description

    Dataset

    This dataset was created by Susham Nandi

    Contents

  7. Distribution of first name and last name frequencies by country

    • figshare.com
    xlsx
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    figshare
    Authors
    Mike Thelwall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of first and last name frequencies of academic authors by country.

    Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

    Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

    From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

    For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

    No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China

  8. r

    Statistics on Swedish names by birth country 2020

    • researchdata.se
    • gimi9.com
    • +1more
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter M. Dahlgren (2021). Statistics on Swedish names by birth country 2020 [Dataset]. http://doi.org/10.5878/s91g-y391
    Explore at:
    (248253), (232422238)Available download formats
    Dataset updated
    Nov 2, 2021
    Dataset provided by
    University of Gothenburg
    Authors
    Peter M. Dahlgren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 31, 2020
    Area covered
    Sweden
    Description

    This dataset contains statistics on names (first names of women, first names of men, and last names) by country of birth. In total, there are 231,505 names by 202 countries. The data comes from Statistics Sweden's population statistics (name register) and refers to persons registered in Sweden on December 31st, 2020. However, some names are excluded due to confidentiality, such as names with fewer than five carriers. The data is licensed with Creative Commons Attribution 4.0 International (CC BY 4.0) and may be used as long as Statistics Sweden is stated as the source. In this dataset, you will also find (in addition to the original data from Statistics Sweden) tidied data where the ISO code for each country has been added, as well as data in so-called wide format and long format to facilitate easier data processing.

    Please see the Swedish version of the post and the README file for more information about the data.

  9. Nyc popular baby names

    • kaggle.com
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Sarkar (2022). Nyc popular baby names [Dataset]. https://www.kaggle.com/datasets/rahulsarkar221/nyc-popular-baby-names
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Kaggle
    Authors
    Rahul Sarkar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York
    Description

    This data contains popular baby names in New York .

    Dataset :- 1 file (popular-baby-names.csv)

    Columns - Year of Birth : Year of the baby's birth. - Gender : Gender of the baby. - Ethnicity : Types of ethnicity they belong to. - Child's First Name : The first name of the child. - Count : How many babies were named . - Ranking : Ranking of that name.

  10. g

    First name file since 1900

    • gimi9.com
    • data.europa.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    First name file since 1900 [Dataset]. https://gimi9.com/dataset/eu_5bf42c958b4c4144b0110ce8
    Explore at:
    Description

    The first names file contains data on the first names attributed to children born in France since 1900. These data are available at the level of France and by department. The files available for download list births and not living people in a given year. They are available in two formats (DBASE and CSV). To use these large files, it is recommended to use a database manager or statistical software. The file at the national level can be opened from some spreadsheets. The file at the departmental level is however too large (3.8 million lines) to be consulted with a spreadsheet, so it is proposed in a lighter version with births since 2000 only. The data can be accessed in: - a national data file containing the first names attributed to children born in France between 1900 and 2022 (data before 2012 relate only to France outside Mayotte) and the numbers by sex associated with each first name; - a departmental data file containing the same information at the department of birth level; - a lighter data file that contains information at the department level of birth since the year 2000.

  11. f

    First names and last names associating with countries

    • figshare.com
    zip
    Updated Jan 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall (2023). First names and last names associating with countries [Dataset]. http://doi.org/10.6084/m9.figshare.21954467.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 25, 2023
    Dataset provided by
    figshare
    Authors
    Mike Thelwall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    First names and last names by country according to affiliations in journal articles 2001-2021 as recorded in Scopus. For 200 countries, there is a complete list of all first names and all last names of at least one researcher with a national affiliation in that country. Each file also records: the number of researchers with that name in the country, the proportion of researchers with that name in the country compared to the world, the number of researchers with that name in the world,

    For example, for the USA:

    Name Authors in USA Proportion in USA Total Sadrach 3 1.000 3 Rangsan 1 0.083 12 Parry 6 0.273 22 Howard 2008 0.733 2739

    Only the first parts of double last names are included. For example, Rodriquez Gonzalez, Maria would have only Rodriquez recorded.

    This is from the paper: "Can national researcher mobility be tracked by first or last name uniqueness"

    List of countries Afghanistan; Albania; Algeria; Angola; Argentina; Armenia; Australia; Austria; Azerbaijan; Bahamas; Bahrain; Bangladesh; Barbados; Belarus; Belgium; Belize; Benin; Bermuda; Bhutan; Bolivia; Bosnia and Herzegovina; Botswana; Brazil; Brunei Darussalam; Bulgaria; Burkina Faso; Burundi; Cambodia; Cameroon; Canada; Cape Verde; Cayman Islands; Central African Republic; Chad; Chile; China; Colombia; Congo; Costa Rica; Cote d'Ivoire; Croatia; Cuba; Cyprus; Czech Republic; Democratic Republic Congo; Denmark; Djibouti; Dominican Republic; Ecuador; Egypt; El Salvador; Eritrea; Estonia; Ethiopia; Falkland Islands (Malvinas); Faroe Islands; Federated States of Micronesia; Fiji; Finland; France; French Guiana; French Polynesia; Gabon; Gambia; Georgia; Germany; Ghana; Greece; Greenland; Grenada; Guadeloupe; Guam; Guatemala; Guinea; Guinea-Bissau; Guyana; Haiti; Honduras; Hong Kong; Hungary; Iceland; India; Indonesia; Iran; Iraq; Ireland; Israel; Italy; Jamaica; Japan; Jordan; Kazakhstan; Kenya; Kuwait; Kyrgyzstan; Laos; Latvia; Lebanon; Lesotho; Liberia; Libyan Arab Jamahiriya; Liechtenstein; Lithuania; Luxembourg; Macao; Macedonia; Madagascar; Malawi; Malaysia; Maldives; Mali; Malta; Martinique; Mauritania; Mauritius; Mexico; Moldova; Monaco; Mongolia; Montenegro; Morocco; Mozambique; Myanmar; Namibia; Nepal; Netherlands; New Caledonia; New Zealand; Nicaragua; Niger; Nigeria; North Korea; North Macedonia; Norway; Oman; Pakistan; Palau; Palestine; Panama; Papua New Guinea; Paraguay; Peru; Philippines; Poland; Portugal; Puerto Rico; Qatar; Reunion; Romania; Russia; Russian Federation; Rwanda; Saint Kitts and Nevis; Samoa; San Marino; Saudi Arabia; Senegal; Serbia; Seychelles; Sierra Leone; Singapore; Slovakia; Slovenia; Solomon Islands; Somalia; South Africa; South Korea; South Sudan; Spain; Sri Lanka; Sudan; Suriname; Swaziland; Sweden; Switzerland; Syrian Arab Republic; Taiwan; Tajikistan; Tanzania; Thailand; Timor-Leste; Togo; Trinidad and Tobago; Tunisia; Turkey; Uganda; Ukraine; United Arab Emirates; United Kingdom; United States; Uruguay; Uzbekistan; Vanuatu; Venezuela; Viet Nam; Virgin Islands (U.S.); Yemen; Yugoslavia; Zambia; Zimbabwe

  12. Gender by Name (Time-series)

    • kaggle.com
    Updated Dec 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Gender by Name (Time-series) [Dataset]. https://www.kaggle.com/datasets/thedevastator/automated-gender-identification-using-name-proba/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    Automated Gender Identification Using Name Probabilities

    2019 US Social Security Administration Data

    By Derek Howard [source]

    About this dataset

    This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.

    To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).

    In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
    Good luck!

    Research Ideas

    • Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.

    • Generate gender neutral names - use this data to generate random names with no gender bias.

    • Automate record lookup - quickly and accurately assign genders based on the probability associated with their name

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.

  13. f

    Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  14. d

    Popular Baby Names - Dataset - data.sa.gov.au

    • data.sa.gov.au
    Updated Mar 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Popular Baby Names - Dataset - data.sa.gov.au [Dataset]. https://data.sa.gov.au/data/dataset/popular-baby-names
    Explore at:
    Dataset updated
    Mar 1, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Australia
    Description

    List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.

  15. w

    Dataset of country full name and individuals using the Internet of countries...

    • workwithdata.com
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of country full name and individuals using the Internet of countries per year in Sri Lanka (Historical) [Dataset]. https://www.workwithdata.com/datasets/countries-yearly?col=country%2Ccountry_long%2Cdate%2Cinternet_pct&f=1&fcol0=country&fop0=%3D&fval0=Sri+Lanka
    Explore at:
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sri Lanka
    Description

    This dataset is about countries per year in Sri Lanka. It has 64 rows. It features 4 columns: country, country full name, and individuals using the Internet.

  16. h

    fun-club-name-generator-dataset

    • huggingface.co
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell (2025). fun-club-name-generator-dataset [Dataset]. https://huggingface.co/datasets/Laurenfromhere/fun-club-name-generator-dataset
    Explore at:
    Dataset updated
    Apr 5, 2025
    Authors
    Mitchell
    Description

    Fun Club Name Generator Dataset

    This is a small, handcrafted dataset of random and fun club name ideas.The goal is to help people who are stuck naming something — whether it's a book club, a gaming group, a project, or just a Discord server between friends.

      Why this?
    

    A few friends and I spent hours trying to name a casual group — everything felt cringey, too serious, or already taken. We started writing down names that made us laugh, and eventually collected enough to… See the full description on the dataset page: https://huggingface.co/datasets/Laurenfromhere/fun-club-name-generator-dataset.

  17. w

    Dataset of country full name and individuals using the Internet of countries...

    • workwithdata.com
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of country full name and individuals using the Internet of countries per year in Poland and in 2023 (Historical) [Dataset]. https://www.workwithdata.com/datasets/countries-yearly?col=country%2Ccountry_long%2Cdate%2Cinternet_pct&f=2&fcol0=country&fcol1=date&fop0=%3D&fop1=%3D&fval0=Poland&fval1=2023
    Explore at:
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Poland
    Description

    This dataset is about countries per year in Poland. It has 1 row and is filtered where the date is 2023. It features 4 columns: country, country full name, and individuals using the Internet.

  18. E

    A corpus of names drawn from the local birth registers of England and Wales,...

    • dtechtive.com
    • find.data.gov.scot
    txt, xlsx, zip
    Updated Jan 25, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh (2018). A corpus of names drawn from the local birth registers of England and Wales, 1838-2014 [Dataset]. http://doi.org/10.7488/ds/2294
    Explore at:
    xlsx(30.21 MB), zip(5.395 MB), txt(0.0166 MB)Available download formats
    Dataset updated
    Jan 25, 2018
    Dataset provided by
    University of Edinburgh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    England, UNITED KINGDOM
    Description

    This dataset comprises a corpus of names, in both the first and middle position, for approximately 22 million individuals born in England and Wales between 1838 and 2014. This data is obtained from birth records made available by a set of volunteer-run genealogical resources - collectively, the 'UK local BMD project' (http://www.ukbmd.org.uk/local) - and has been re-purposed here to demonstrate the applicability of network analysis methods to an onomastic dataset. The ownership and licensing of the intellectual property constituting the original birth records is detailed at https://www.ukbmd.org.uk/TermsAndConditions. Under section 29A of the UK Copyright, Designs and Patents Act 1988, a copyright exception permits copies to be made of lawfully accessible material in order to conduct text and data mining for non-commercial research. The data included in this dataset represents the outcome of such a text-mining analysis. No birth records are included in this dataset, and nor is it possible for records to be reconstructed from the data presented herein. The data comprises an archive of tables, presenting this corpus in various forms: as a rank order of names (in both the first and middle position) by number of registered births per year, and by the total number of births across all years sampled. An overview of the data is also provided, with summary statistics such as the number of usable records registered per year, most popular names per year, and measures of forename diversity and the surname-to-forename usage ratio (an indicator of which forenames are more likely to be transferred uses of surnames). These tables are extensive but not exhaustive, and do not exclude the possibility that errors are present in the corpus. Data are also presented both as '.expression' files (an input format readable by the network analysis tool Graphia Professional) and as '.layout' files, a text file format output by Graphia Professional that describes the characteristics of the network so that it may be replicated. Characteristics of the original birth records that allow the identification of individuals - for instance, full name or location of birth - have been removed.

  19. o

    Gender and Ethnicity Predictions for California City Council Members and...

    • openicpsr.org
    • dataverse.harvard.edu
    delimited
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan M. Dalal (2024). Gender and Ethnicity Predictions for California City Council Members and School Board Members, 2010-2023 [Dataset]. http://doi.org/10.3886/E209861V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    Crystal Springs Uplands School
    Authors
    Rohan M. Dalal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2010 - 2023
    Area covered
    California City, California
    Description

    To conduct this study, I sourced demographic data from 2010 to 2023 from the California Elections Data Archive (CEDA) for city council members and school board members. The CEDA data provide a full list of candidate names and the number of votes a given candidate received for every city council and school board election. I assigned the gender to each candidate based on the lists of popular male and female names provided by the Social Security Administration. Since the average age of city council members is 46 years old according to the Bureau of Labor Statistics, I compiled a list of popular male and female given names for babies born in the 1960s, 1970s, and 1980s. Then, I automated the gender classification as follows: for example, as “Lisa” is identified as a popular female given name by the Social Security Administration, every candidate whose first name is “Lisa” was assigned “female” in our dataset. For a gender-neutral name that appeared on the lists for both male and female given names, which included “Alex” and “Casey,” I used the following keywords “[first name] [last name] [office type (either “city council” or “school board”)] [name of the city or the school district]” to search for more information about the official’s gender online. My search returned either a picture to help clearly identify the official’s gender and/or an article that refers to the official with gendered pronouns. To identify the ethnicity of each elected official, I used the 2010 Census data and the 23AndMe Surname Discovery Tool. The 2010 Census lists surnames occurring at least 100 times, and it includes self-reported ethnicity data for individuals with a given surname. Similarly, the 23AndMe Surname Discovery Tool gives the percentage of individuals with the given surname who identify as each of four different ethnicity groups: Hispanic, White, Asian/Pacific Islander, and Black based on the 2010 US Census data. For surnames that did not appear on either the 2010 Census data or the 23AndMe Surname Discovery Tool, I used Python’s Ethnicolr library, which bases its prediction of ethnicity using either both first and last name or just the last name on the US census data (2000 and 2010), the Florida voting registration data, and the Wikipedia data.

  20. o

    Geonames - All Cities with a population > 1000

    • public.opendatasoft.com
    • data.smartidf.services
    • +2more
    csv, excel, geojson +1
    Updated Mar 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
    Explore at:
    csv, json, geojson, excelAvailable download formats
    Dataset updated
    Mar 10, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datasets/datagov/usa-names
Organization logo

USA Name Data

USA Name Data (BigQuery Dataset)

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Data.govhttps://data.gov/
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
United States
Description

Context

Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

Content

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

https://cloud.google.com/bigquery/public-data/usa-names

Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @dcp from Unplash.

Inspiration

What are the most common names?

What are the most common female names?

Are there more female or male names?

Female names by a wide margin?

Search
Clear search
Close search
Google apps
Main menu