Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
Facebook
TwitterWe provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
So I just recently finished my Google Data Analytics Certification and I want to stay fresh! I went through the public data sets and found USA names. I thought might be fun to pull some information on just my name and analyze
You will find the State, Year and Count
Big Query Public Data Set USA Names
Need to sharpen my skills
Facebook
TwitterPopular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
Facebook
TwitterThe data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
Facebook
Twitterhttps://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/
This is modified version of KaraKaraWitch/PIPPA-ShareGPT-formatted. I added randomized names for each conversation and moved the description of the character into a system message and some other cleanup. The randomized names might be causing some problems like the bot using incorrect pronouns etc.
Original dataset card:
KaraKaraWitch/PIPPA-IHaveNeverFeltNeedToSend
I've never felt the need to send a photo of my
The following is the… See the full description on the dataset page: https://huggingface.co/datasets/mpasila/PIPPA-ShareGPT-formatted-named.
Facebook
TwitterThe first names file contains data on the first names attributed to children born in France since 1900. These data are available at the level of France and by department.
The files available for download list births and not living people in a given year. They are available in two formats (DBASE and CSV). To use these large files, it is recommended to use a database manager or statistical software. The file at the national level can be opened from some spreadsheets. The file at the departmental level is however too large (3.8 million lines) to be consulted with a spreadsheet, so it is proposed in a lighter version with births since 2000 only.
The data can be accessed in: - a national data file containing the first names attributed to children born in France between 1900 and 2022 (data before 2012 relate only to France outside Mayotte) and the numbers by sex associated with each first name; - a departmental data file containing the same information at the department of birth level; - a lighter data file that contains information at the department level of birth since the year 2000.
Facebook
TwitterThe writers are Europeans who often write spanish. The device is scanner, the collection angle is eye-level angle. The dataset content includes address, company name, personal name.The dataset can be used for Spanish OCR models and handwritten text recognition systems.
Facebook
TwitterThe Places dataset was published on September 22, 2025 from the U.S. Department of Commerce, U.S. Census Bureau, Geography Division and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). This resource is a member of a series. The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) System (MTS). The MTS represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The TIGER/Line shapefiles include both incorporated places (legal entities) and census designated places or CDPs (statistical entities). An incorporated place is established to provide governmental functions for a concentration of people as opposed to a minor civil division (MCD), which generally is created to provide services or administer an area without regard, necessarily, to population. Places always nest within a state but may extend across county and county subdivision boundaries. An incorporated place is usually a city, town, village, or borough, but can have other legal descriptions. CDPs are delineated for the decennial census as the statistical counterparts of incorporated places. CDPs are delineated to provide data for settled concentrations of population that are identifiable by name but are not legally incorporated under the laws of the state in which they are located. The boundaries for CDPs are often defined in partnership with state, local, and/or tribal officials and usually coincide with visible features or the boundary of an adjacent incorporated place or another legal entity. CDP boundaries often change from one decennial census to the next with changes in the settlement pattern and development; a CDP with the same name as in an earlier census does not necessarily have the same boundary. The only population/housing size requirement for CDPs is that they must contain some housing and population. The boundaries of most incorporated places in this shapefile are as of January 1, 2024, as reported through the Census Bureau's Boundary and Annexation Survey (BAS). The boundaries of all CDPs were delineated as part of the Census Bureau's Participant Statistical Areas Program (PSAP) for the 2020 Census, but some CDPs were added or updated through the 2024 BAS as well. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1529072
Facebook
TwitterDemographic data on first names -- in particular race.
Notice that there is a row for many first names, and there is a row for "all other first names" that can throw off your analysis.
I got this data from the Harvard Dataverse: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ/MPMHFE&version=1.3, and I did absolutely nothing and all credit is theirs. I'm surprised I couldn't already find this data on Kaggle, so if there is a reason it shouldn't be here, I will happily take this down.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The "Person Name Population" dataset provides information on the incidence and frequency of different names within a population. It consists of three columns: Name, Incidence, and Frequency.
The "Name" column represents the individual names of people, while the "Incidence" column denotes the number of individuals in the population who bear that particular name. The "Frequency" column indicates the relative occurrence or proportion of each name within the population.
This dataset can be valuable for various purposes, including:
1. Sociodemographic Analysis: Researchers or analysts can utilize this dataset to study the distribution and prevalence of different names within a specific population. They can uncover patterns, trends, and cultural preferences related to naming practices.
**2. Name Popularity Studies: **The dataset enables the exploration of popular or commonly used names. It allows researchers to identify the most prevalent names and track their frequency over time. This information can be useful for understanding naming trends and societal changes.
3. Market Research: Companies and marketers can leverage this dataset to gain insights into consumer preferences and behaviors. They can analyze the popularity of names to inform targeted marketing strategies, such as personalized messaging or product customization.
**4. Social Studies: **Sociologists and anthropologists can use the dataset to investigate the cultural significance of names within a specific population. They can explore naming conventions, naming traditions across different regions or ethnicities, and the impact of cultural factors on name selection.
5. Historical Research: Historians may find this dataset valuable for studying name trends and patterns across different time periods. It can provide insights into naming practices in the past, allowing researchers to analyze societal changes, migration patterns, or cultural influences.
Facebook
TwitterA dataset full of first names can be incredibly helpful in various applications and analyses. Firstly, it can be used in demographic studies to analyze naming trends and patterns over time, providing insights into cultural and societal changes. Additionally, such a dataset can be utilized in market research and targeted advertising, allowing businesses to personalize their marketing strategies based on customers' names. It can also be employed in language processing tasks, such as name entity recognition, sentiment analysis, or gender prediction. Moreover, the dataset can serve as a valuable resource for generating test data, creating fictional characters, or enhancing natural language generation models. Overall, a comprehensive first name dataset has diverse applications across multiple domains.
This dataset contains common Indian names, this data can be used for any sort of NLP problems such as name generation etc. An upvote would be appreciated :)
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset contains two tables of baby name data for 8 anglosphere regions: Australia, Canada, England and Wales (grouped together), Ireland, Northern Ireland, New Zealand, Scotland and the USA. It can be used to compare the popularity of names in the anglosphere over time, and I have used it to determine the "Country-ness" of each particular name, or, how much more popular it is in its most popular country as compared to all the other countries.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains US baby names from the Social Security Administration dating back to 1879. With over 150 years of data, this is one of the most comprehensive datasets on baby names in the US. The data includes the name, year of birth, sex, and number of babies with that name for each year. This dataset is a great resource for anyone interested in studying baby naming trends over time
This dataset is a compilation of over 140 years of data from the Social Security Administration. It includes data on baby names, year of birth, and sex. There are also columns for the number of babies with that name born in that year.
This dataset can be used to track changes in baby naming trends over time, or to study how popular names have changed in popularity. It can also be used to study how naming trends differ between sexes, or between different years
This dataset could be used for a number of things, including: 1. Determining baby name trends over time 2. Finding out what the most popular baby names are in the US 3. Analyzing how baby name popularity has changed over the years
If you use this dataset in your research, please credit @nickgott, @rflprr and the Social Security Administration via Data.gov
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I started these datasets to learn how to manipulate files in different formats with python. You can see the Github repo here https://github.com/rokelina/names-analysis
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
By Amber Thomas [source]
This dataset contains data on foundation products from Sephora and Ulta, including brand, product, shade name, and color information. It was used in a visual essay on The Pudding entitled The Naked Truth to better understand how beauty brands name their foundation products.
Data were collected from the US versions of Sephora and Ulta’s websites using Microsoft Playwright. Shades that were transparent or untinted were removed resulting in 6,816 swatches from 107 brands and 328 products. Shade names were determined by scraping the alt text of the swatch images using RegEx to extract the name from the description listed, with 10% of names manually extracted from the alt text. Hex values and lightness for each shade on the website for each product were determined as well using packages in R language.
Categories for shade names were assigned manually based on interpretation and may differ based on context clues provided by entire product lines. For example Estée Lauder has a shade called dawn and another called dusk, presumably referring to time of day while EXA has a similarly named shade that presumably refers to a person's name. Categories include Animal; Color; Compliment; Descriptor; Drink; Food; Gem; Location; Metal; Misc.; Name ; Plant ; Rock ; Skin ; Textile ; Wood etc..
Raw files may be different than clean files as additional manual extraction steps have been taken into account when compiling
allShades.csv. Get ready to uncover The Naked Truth with this data set!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information about brands, products, URLs, descriptions, images sources, image alt texts and shade names for different foundation products. With this data you can gain insight into the variety of color-naming conventions used by beauty brands. The organization of this dataset includes raw files (Sephora & Ulta) and a clean file (AllShades).
Accessing the Dataset
This dataset is available as multiple .csv files on kaggle.com under “Brands’ Foundation Color Names”. All data were collected from the US versions of Sephora and Ulta’s websites on January 11th and January 18th, respectively using Microsoft Playwright. There may be some variation in the name column between these datasets because additional manual extraction was incorporated into the AllShades file for those names that couldn't be programmatically extracted in order to make all of them consistent across all platforms. Furthermore, there are categories associated with each label which have been manually assigned based on common words or phrases that appear in many product shade names. These include animals, colors, compliments, descriptors etc., which you can use to analyze trend patterns in how brands are naming their makeup products from natural elements to non-traditional vocabulary!
## Columns Overview Column names in this dataset include Brand Name; URL; Description; Image Source; Image Alt Text; Name (shade name); Specific Shade Information i..e colorspace/Hex Value/Hue/Saturation/Lightness as well as a Category describing each shade label based on its content(animal/color/compliment/descriptor etc). This column also includes an estimate of 10% per brand whose name had to be manually extracted from image alt text due lists such as 001. By creating visualizations such lightness distributions or bar graphs looking at different categorizes like type or country with filter functions you can closely look at trends between shades of foundations separated by gender norms over time! There are lots of possibilities here so get creative!
## Analyzing Trends
By leveraging R packages such as imager & magick alongside structure query language queries it’s possible analyze trends concerning things like lightness values & hex codes across multiple brands while using plots like histograms & bar charts one can compare items that have been categorized according to topic (plant versus animal etc.) Finally depending upon what type visual your goal is
- Analyzing the prevalence of certain categories (e.g. colors, foods, animals) in foundation shade names across different brands to better understand potential target markets and marketing strategies
- Using trend analyses on the dataset to track changing trends in foundation shade names over time
- Creating visualizations to compare lightness, hue and saturation measurements for each swatc...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Popular Baby Names by Gender and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. This data can be used to represent the popularity of a name.
This dataset is useful for data science projects focusing on trends analysis over time. Data scientists can use this dataset for sociological studies, such as understanding cultural integration or the impact of media on naming trends.
Here are a few analysis that I believe can be performed:
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?