The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set lists the sex and number of birth registrations for each first name, from 1900 onward. Years are grouped by the date of the birth registration, not by the date of birth. Some birth registrations are not included, such as registrations with a sex other than Male or Female (i.e. indeterminate or not recorded), or where the birth registration date is not recorded. These excluded records are so few their exclusion is unlikely to have any significant impact on the data. Where a name has less than 10 instances in a particular year, the name will not be included in the data for that year. Due to this, total volumes will be less than the total birth registrations in that year. As first and middle names are recorded in our system together, the first name has been split off from the middle names. Due to the size of the data set, this was done with an automated system, generally looking for the first space in the name. This means there may be names not correctly added. Also, certain symbols in names may not carry through to the data correctly. Please let us know using the contact email address if you find any errors in the data.
We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
In 2018 we were doing a research project, and we needed to know if a name was male or female. After Googling for hours for 'baby name lists', 'name databases' and 'name datasets' we discovered that there wasn't a complete name database for all countries with first names and gender. Most name database layouts we found different per country, were incomplete or contained non-existing names. That is why we created Name Census, the most comprehensive name database in the world! The Name Census top 100 databases is a free database containing the top 100 first names and top 100 surnames for each country.
Our name database is created using first names and surnames obtained from governments and cross-referencing with millions of names from publicly available social media profiles. We took all those names and used millions of social media profiles that where publicly available to cross-reference and count each name per country. This way we were sure that the names in our name database are actually used and we could create our popularity metric. We now offer the complete name database and the name parsing service as separate services.
The Name Census top 100 is a name database that consists out of two files; the first names top 100 per country and the surnames top 100 per country. Each file is a CSV file formatted in UTF-8.
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data. All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year. List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
A large and fast-growing number of studies across the social sciences use experiments to better understand the role of race in human interactions, particularly in the American context. Researchers often use names to signal the race of individuals portrayed in these experiments. However, those names might also signal other attributes, such as socioeconomic status (e.g., education and income) and citizenship. If they do, researchers need pre-tested names with data on perceptions of these attributes. Such data would permit researchers to draw correct inferences about the causal effect of race in their experiments. In this paper, we provide the largest dataset of validated name perceptions based on three different surveys conducted in the United States. In total, our data include over 44,170 name evaluations from 4,026 respondents for 600 names. In addition to respondent perceptions of race, income, education, and citizenship from names, our data also include respondent characteristics. Our data will be broadly helpful for researchers conducting experiments on the manifold ways in which race shapes American life.
License: CC-By Attribution 4.0 International
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Excel spreadsheet of the 100 male and female first names for each year since 1954 to most recent year, based on births registered in New Zealand during each year.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I started these datasets to learn how to manipulate files in different formats with python. You can see the Github repo here https://github.com/rokelina/names-analysis
This dataset comprises names of people in 18 different languages. There are 18 text files belonging to 18 languages, each has names in it.
18 text files of 18 languages, each has name of people in that language.
This dataset belongs to PyTorch.
The dataset contains statistical information on the number of persons with a specific combination of personal names and personal names (multiple names) included in the Register of Natural Persons (until 06.28.2021). Population Register). It should be noted that the Register of Natural Persons also includes personal names of foreigners in the Latin alphabet transliteration according to the travel document issued by the foreign state (for example, Nicola, Alex), which does not comply with the norms of the Latvian literary language.
As of 2023.10.01, the dataset contains information on gender (male, female) of combinations of names and personal names of persons registered in the Register of Natural Persons.
Fun Club Name Generator Dataset
This is a small, handcrafted dataset of random and fun club name ideas.The goal is to help people who are stuck naming something — whether it's a book club, a gaming group, a project, or just a Discord server between friends.
Why this?
A few friends and I spent hours trying to name a casual group — everything felt cringey, too serious, or already taken. We started writing down names that made us laugh, and eventually collected enough to… See the full description on the dataset page: https://huggingface.co/datasets/Laurenfromhere/fun-club-name-generator-dataset.
This dataset contains ranks and counts for the top 25 baby names by sex for live births that occurred in California (by occurrence) based on information entered on birth certificates.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.
The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.
I get to know this dataset from a github repository which can be visited here
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
34808 Unique Names: This consist of a dataset of Names of people with their corresponding Gender of Indian Race. With Three columns namely - Name, Gender, Race
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Language Learning Assistance: With the "Names" model, users can more easily learn to identify and differentiate between various word classes in the given characters set, improving their reading and pronunciation skills in the languages that use these characters.
Optical Character Recognition (OCR): This model can be applied to develop an OCR system for accurately detecting text and word classes in images or scanned documents, aiding transcription, data extraction, and digitization of printed materials using these characters.
Speech-to-Text Conversion: The "Names" model can be integrated into speech-to-text systems that handle multiple languages using the given characters set to help accurately transcribe spoken words and phrases, taking into account the identified word classes.
Document Analysis and Information Retrieval: Implement the model for analyzing and categorizing documents based on the identified word classes, helping to improve search results, content organization, and knowledge extraction from documents containing these characters.
Assistive Technologies: Utilize the "Names" model to develop tools for people with visual impairments, reading difficulties or learning disabilities, enabling them to understand and process text in languages that use the given character set more effectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about countries per year in Southern Africa. It has 320 rows. It features 4 columns: country, country full name, and individuals using the Internet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series, has 1 rows and is filtered where the books is America-naming the country and its people. It features 2 columns: book series, and publication dates. The preview is ordered by number of books (descending).
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.