https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
This statistic shows the most frequent combinations of first name and last name in the United States, as of 2013. According to this ranking, the name "James Smith" occurs most often and is most popular in the United States.
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .
Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.
Dataset Details Dataset Description
This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.
This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).
From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.
Dataset Sources
Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus
NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.
Dataset Structure
text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus
Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1
Dataset Creation Curation Rationale
For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.
Source Data
The dataset is derived from BookCorpus by filtering it and extracting the template structure.
We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.
Data Collection and Processing
We filter the entries of BookCorpus and include only sentences that meet the following criteria:
Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.
This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.
The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.
Bias, Risks, and Limitations
Due to BookCorpus, only lower-case sentences are contained.
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
As the popularity of social media has been growing in the past few years, the source wanted to know if French girls and boys aged between 11 and 14 and between 15 and 18 years old used their real name or identity on social media. Thus, most boys aged 11 to 14 did not expose their real name or identity on social media, compared to almost 81 percent of girls aged 15 to 18.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Count of popularity of adult first names (forenames, given names) in Peru, from an approximately 7% sample of the adult population.
In Peru, many people are registered as supporters of political parties, and their names are published by the Registro de Organizaciones Políticas. The lists include a DNI (national identity number) for each person to avoid duplicates. The 1,572,002 people on these lists (excluding the regional movements) represent around 7% of the adult population of Peru.
The first and middle names have been sorted and counted (there are an average of 1.6 first names for each person).
These 2,538,011 first (and middle) names represent 76,720 different names, most of which are infrequent. The file has been limited to names that occur ten or more times in the sample, which is 7,250 unique names (2,417,750 names, more than 95% of the total).
Each row in the file contains the rank, a percentage of that name in the entire set of 2,538,011 names, a count of the times the name occurs in the sample, and the name.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Count of family names (surnames, last names) in Peru, from an approximately 7% sample of the adult population.
In Peru, many people are registered as supporters of political parties, and their names are published by the Registro de Organizaciones Políticas. The lists include a DNI (national identity number) for each person to avoid duplicates. The 1,572,002 people on these lists (excluding the regional movements) represent around 7% of the adult population of Peru.
Their maternal and paternal family names have been sorted and counted. Nearly all of the names have entries for both paternal and maternal names.
These 3,142,561 family names represent 85,395 different names, most of which are infrequent. The file has been limited to names that occur ten or more times in the sample, which is 12,139 unique names (3,021,655 names, more than 96% of the total).
Each row in the file contains the rank, a percentage of that name in the entire set of 3,142,561 names, a count of the times the name occurs in the sample, and the name.
There are some names (around 800) in this file that contain a space. In most cases, these are names like "GARCIA DE RUIZ", where RUIZ is the name of the woman's husband. There are also cases where the name is like "DE LA CRUZ", which is a complete family name. No attempt has been made to remove the part of names which refer to the husband's name, this could be considered for a later version.
As of January 2024, Nielsen was the most common surname in Denmark. That year, 229,000 people bore the name in the country. That was around 3,000 individuals more compared to the second most popular surname, Jensen. Historically, most surnames in Denmark were created by using the patronymic tradition until hereditary surnames became mandatory in the 1820s. This was also a common tradition in some of the other Nordic countries. For Danish surnames, this meant to have the suffix -sen (son) or -datter (daughter) added to the father’s name.
Female names
The number of women in Denmark amounted to approximately 2.98 million in 2023. Among these, the most common first name was Anne, with around 44,100 women having the name that year. The name originally derived from the name Hannah or Anna. Other popular female names in Denmark were Kirsten, Mette, and Hanne.
Male names
Among the 2.95 million men lived in Denmark as of 2023, and Peter was the most frequent name. As of January 2024, around 46.500 men bore the name, which is also found in the variants Petar, Peder, and Petter. The names Jens, Michael, and Lars were also very common among the Danish men.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distribution of first and last name frequencies of academic authors by country.
Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.
Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.
From the paper: Can national researcher mobility be tracked by first or last name uniqueness?
For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:
No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This corpus comprises person names uttered by 100 speakers of different dialects, ages and various educational levels. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 4.57 hours of speech. The total capacity of the data is 328 Mb. Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually. The corpus aims to be applied to the testing and telephone natural speech recognition system.
34808 Unique Names: This consist of a dataset of Names of people with their corresponding Gender of Indian Race. With Three columns namely - Name, Gender, Race
On January 26, 2022, the French National Assembly adopted, in first reading and with modifications, a law proposal aimed at facilitating the steps of people who wish to bear the name of the parent which was not given to them at birth, whether it is the name of use (name of daily life) or the family name (the one registered on the civil status certificate). The text provides that from now on, people over 18 years old would be able to change their family name once in their life by deciding to take either their mother's name, their father's name, or both names, in the order they wish. According to a survey conducted among French people at the beginning of January 2022, more than one out of five French people would want to change their name if it were made possible by the public authorities. This proportion varies significantly according to age: nearly half of 18-24 year olds would want to change their name, compared to 14 percent of those over 50. In France, the majority of married women still bear their husband's name.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This corpus comprises 2,000 Japanese person names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 12.45 hours of speech per channel. The total capacity of the data is 5.67 Gb.Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.The corpus aims to be applied to the testing and telephone natural speech recognition system. This corpus is partly included in ELRA-S0228-54.
Millennials were the largest generation group in the United States in 2024, with an estimated population of ***** million. Born between 1981 and 1996, Millennials recently surpassed Baby Boomers as the biggest group, and they will continue to be a major part of the population for many years. The rise of Generation Alpha Generation Alpha is the most recent to have been named, and many group members will not be able to remember a time before smartphones and social media. As of 2024, the oldest Generation Alpha members were still only aging into adolescents. However, the group already makes up around ***** percent of the U.S. population, and they are said to be the most racially and ethnically diverse of all the generation groups. Boomers vs. Millennials The number of Baby Boomers, whose generation was defined by the boom in births following the Second World War, has fallen by around ***** million since 2010. However, they remain the second-largest generation group, and aging Boomers are contributing to steady increases in the median age of the population. Meanwhile, the Millennial generation continues to grow, and one reason for this is the increasing number of young immigrants arriving in the United States.
Francisco was the most popular first name for boys registered in Portugal in 2024, with 1,270 registrations. Lourenço followed, with 1,040 newborn baby boys under this name, while Vicente and Tomás closed the podium, with 1,036 registrations each. The names for baby girls in Portugal were dominated, in 2024, by the name Maria, which was registered 4,295 times. Alice and Benedita followed at a distance, with an average of 980 registrations each. Sinking birth rates and rising life expectancy in Portugal and throughout Europe Europe’s crude birth rate was 9.2 in 2022, having slumped when compared to previous decades. The low birth rates on the continent occurred simultaneously with an increasing life expectancy, which emphasizes the aging of the European population. Also in 2022, Portugal presented one of the continent’s lowest birth rates, namely 7.8, and the average age of women when giving birth to their first child has risen continuously over the last decade. However, since 2021 there has been a decrease. Decreasing population in Portugal, but boosting numbers of elderly people The Portuguese population is expected to decrease during the upcoming decade. As of 2035, it is predicted that Portugal’s nationals will equal to less than 10 million, almost 2.9 million of which will be 65 years of age and older. This figure presents an increase of almost 700,000 senior citizens compared to the recorded figures of 2015.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.
The dataset is available in the following formats:
JSONL format provided by Prodigy
binary spaCy format (ready to use with the spaCy train pipeline)
The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.
The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.
Tagset
NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.
NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.
ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.
Relation: spatial relation, e.g. dans, sur, à 10 lieues de.
Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.
NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.
NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.
ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.
NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique
ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.
Head: entry name
Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.
HuggingFace
The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA
spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries
This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.
Acknowledgement
The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
Among the 2.98 million female inhabitants in Denmark in 2023, the most common name was Anne. As of January 2024, around 43,700 Danish women bore the name. There are several variations of it such as Anna, which is very popular in Denmark as well and, with a number of approximately 34,000, ranked on fifth place.
Danish surnames Most Danish last names are based on patronymics. Until the 1820s, it was common for ordinary people to use the Christian name of a person’s father, followed by “sen” (=son) or “datter” (=daughter) to create the surname. Nielsen for example is a patronymic surname meaning “son of Niels” and ranked first among all common surnames in Denmark as of January 2024.
Female names in other Nordic countries When looking at Iceland, Norway and Sweden, it is noticeable that the first name Anne or the variation Anna is very common there as well. In 2023, Anna was the most common female name in Iceland. Also in Sweden, Anna ranked first in 2022, and in Norway, 70,000 women bore the name Anne.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
First names and last names by country according to affiliations in journal articles 2001-2021 as recorded in Scopus. For 200 countries, there is a complete list of all first names and all last names of at least one researcher with a national affiliation in that country. Each file also records: the number of researchers with that name in the country, the proportion of researchers with that name in the country compared to the world, the number of researchers with that name in the world,
For example, for the USA:
Name Authors in USA Proportion in USA Total Sadrach 3 1.000 3 Rangsan 1 0.083 12 Parry 6 0.273 22 Howard 2008 0.733 2739
Only the first parts of double last names are included. For example, Rodriquez Gonzalez, Maria would have only Rodriquez recorded.
This is from the paper: "Can national researcher mobility be tracked by first or last name uniqueness"
List of countries Afghanistan; Albania; Algeria; Angola; Argentina; Armenia; Australia; Austria; Azerbaijan; Bahamas; Bahrain; Bangladesh; Barbados; Belarus; Belgium; Belize; Benin; Bermuda; Bhutan; Bolivia; Bosnia and Herzegovina; Botswana; Brazil; Brunei Darussalam; Bulgaria; Burkina Faso; Burundi; Cambodia; Cameroon; Canada; Cape Verde; Cayman Islands; Central African Republic; Chad; Chile; China; Colombia; Congo; Costa Rica; Cote d'Ivoire; Croatia; Cuba; Cyprus; Czech Republic; Democratic Republic Congo; Denmark; Djibouti; Dominican Republic; Ecuador; Egypt; El Salvador; Eritrea; Estonia; Ethiopia; Falkland Islands (Malvinas); Faroe Islands; Federated States of Micronesia; Fiji; Finland; France; French Guiana; French Polynesia; Gabon; Gambia; Georgia; Germany; Ghana; Greece; Greenland; Grenada; Guadeloupe; Guam; Guatemala; Guinea; Guinea-Bissau; Guyana; Haiti; Honduras; Hong Kong; Hungary; Iceland; India; Indonesia; Iran; Iraq; Ireland; Israel; Italy; Jamaica; Japan; Jordan; Kazakhstan; Kenya; Kuwait; Kyrgyzstan; Laos; Latvia; Lebanon; Lesotho; Liberia; Libyan Arab Jamahiriya; Liechtenstein; Lithuania; Luxembourg; Macao; Macedonia; Madagascar; Malawi; Malaysia; Maldives; Mali; Malta; Martinique; Mauritania; Mauritius; Mexico; Moldova; Monaco; Mongolia; Montenegro; Morocco; Mozambique; Myanmar; Namibia; Nepal; Netherlands; New Caledonia; New Zealand; Nicaragua; Niger; Nigeria; North Korea; North Macedonia; Norway; Oman; Pakistan; Palau; Palestine; Panama; Papua New Guinea; Paraguay; Peru; Philippines; Poland; Portugal; Puerto Rico; Qatar; Reunion; Romania; Russia; Russian Federation; Rwanda; Saint Kitts and Nevis; Samoa; San Marino; Saudi Arabia; Senegal; Serbia; Seychelles; Sierra Leone; Singapore; Slovakia; Slovenia; Solomon Islands; Somalia; South Africa; South Korea; South Sudan; Spain; Sri Lanka; Sudan; Suriname; Swaziland; Sweden; Switzerland; Syrian Arab Republic; Taiwan; Tajikistan; Tanzania; Thailand; Timor-Leste; Togo; Trinidad and Tobago; Tunisia; Turkey; Uganda; Ukraine; United Arab Emirates; United Kingdom; United States; Uruguay; Uzbekistan; Vanuatu; Venezuela; Viet Nam; Virgin Islands (U.S.); Yemen; Yugoslavia; Zambia; Zimbabwe
The Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.
This dataset represents the popular last names in the United States for White.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?