Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The popularity of baby names is a fascinating reflection of our society's cultural trends and values over time. The dataset on the most popular baby names from 1880 until now provides a comprehensive look at the evolution of naming practices in the United States over the last 140 years. The dataset includes information on the top 1000 baby names for each year, as well as the number of babies given each name, broken down by gender.
By analyzing this dataset, researchers can identify trends and patterns in baby naming, such as the rise and fall of certain names, the influence of popular culture on naming trends, and the impact of immigration on naming practices. This dataset is a valuable resource for researchers, parents, and anyone interested in exploring the social and cultural history of the United States.
Facebook
TwitterPopular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains US baby names from the Social Security Administration dating back to 1879. With over 150 years of data, this is one of the most comprehensive datasets on baby names in the US. The data includes the name, year of birth, sex, and number of babies with that name for each year. This dataset is a great resource for anyone interested in studying baby naming trends over time
This dataset is a compilation of over 140 years of data from the Social Security Administration. It includes data on baby names, year of birth, and sex. There are also columns for the number of babies with that name born in that year.
This dataset can be used to track changes in baby naming trends over time, or to study how popular names have changed in popularity. It can also be used to study how naming trends differ between sexes, or between different years
This dataset could be used for a number of things, including: 1. Determining baby name trends over time 2. Finding out what the most popular baby names are in the US 3. Analyzing how baby name popularity has changed over the years
If you use this dataset in your research, please credit @nickgott, @rflprr and the Social Security Administration via Data.gov
Facebook
TwitterThe most popular baby names by sex and mother's ethnicity in New York City from 2011-2014.
Popular Baby Name Data In NYC from 2011-2014
Rows: 13962; Columns: 6
The data include items, such as:
BRTH_YR: birth year the baby GNDR: gender ETHCTY: mother's ethnicity NM: baby's name CNT: count of the name RNK: ranking of the name
Source: NYC Open Data
Facebook
TwitterThis dataset contains ranks and counts for the top 25 baby names by sex for live births that occurred in California (by occurrence) based on information entered on birth certificates.
Facebook
TwitterThe data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains counts and rankings of the most common first names in the United States, sourced from comprehensive name census data. It is ideal for analyzing naming trends, demographic patterns, and cultural preferences, as well as for building statistical models to explore name popularity over time.
male_first_names.csv: Male first name frequencies and rankings in the U.S.
female_first_names.csv: Female first name frequencies and rankings in the U.S.
Facebook
TwitterThis dataset tracks the updates made on the dataset "Most Popular Baby Names" as a repository for previous versions of the data and metadata.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
This is a dataset hosted by the City of New York. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
Cover photo by freestocks.org on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Historic lists of top 100 names for baby boys and girls for 1904 to 2024 at 10-yearly intervals.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
Facebook
TwitterThis dataset contains the common names of the national forests and grasslands and their respective FS WWW URL information that is used for both display of the national forest and national grassland boundaries on any map product and for dynamic interactivity of the map. This dataset exhibits the following characteristics: 1. Granularity of the polygon features - The spatial extent of the national forests and the grasslands match the way the agency would like to communicate with the public. 2. Preferred /Common Name of the National Forest Units - The common names of the national forest and grassland match the preferred name column that is present in the common names decision table maintained by the FS Office of Communication. 3. Hyperlinks to FS WWW Home page - This column contains the national forest and their respective FS WWW URL information. This URL could be used on any interactive map applications to link users directly to a forest's home page. Data Source - This dataset is derived from the following FS ALP (Automated Lands Program) Land Status Records System authoritative data sources: 1. Administrative Forest Boundaries 2. Proclaimed Forest Boundaries 3. Ranger District Boundaries 4. National Grassland Areas. The common names decision table maintained by the FS Office of Communication contains the common name and its respective Land Status Records System authoritative data source to be used for building the spatial polygon. The spatial polygons for every feature in this dataset comes from one or more authoritative data sources listed above. The process to create the common names dataset is reusing the already existing ALP names from the data sources listed above.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Queensland Top 100 Baby Names
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The Most Popular Names in the Philippines dataset provides insights into the popularity of different names in the Philippines.
The dataset includes the following fields:
rank: The position of the name when graded by incidence with all other names in the place.forename: The personal name given to an individual at or shortly after birth, also known as a first name.incidence: Number of people who bear the name.frequency: Ratio and percentage of people who bear the name.gender: The gender of the specific name based on the percentage.gender_percentage: The percentage of bearers who are male or female.This dataset can be used for various purposes, such as:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises a corpus of names, in both the first and middle position, for approximately 22 million individuals born in England and Wales between 1838 and 2014. This data is obtained from birth records made available by a set of volunteer-run genealogical resources - collectively, the 'UK local BMD project' (http://www.ukbmd.org.uk/local) - and has been re-purposed here to demonstrate the applicability of network analysis methods to an onomastic dataset. The ownership and licensing of the intellectual property constituting the original birth records is detailed at https://www.ukbmd.org.uk/TermsAndConditions. Under section 29A of the UK Copyright, Designs and Patents Act 1988, a copyright exception permits copies to be made of lawfully accessible material in order to conduct text and data mining for non-commercial research. The data included in this dataset represents the outcome of such a text-mining analysis. No birth records are included in this dataset, and nor is it possible for records to be reconstructed from the data presented herein. The data comprises an archive of tables, presenting this corpus in various forms: as a rank order of names (in both the first and middle position) by number of registered births per year, and by the total number of births across all years sampled. An overview of the data is also provided, with summary statistics such as the number of usable records registered per year, most popular names per year, and measures of forename diversity and the surname-to-forename usage ratio (an indicator of which forenames are more likely to be transferred uses of surnames). These tables are extensive but not exhaustive, and do not exclude the possibility that errors are present in the corpus. Data are also presented both as '.expression' files (an input format readable by the network analysis tool Graphia Professional) and as '.layout' files, a text file format output by Graphia Professional that describes the characteristics of the network so that it may be replicated. Characteristics of the original birth records that allow the identification of individuals - for instance, full name or location of birth - have been removed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.
The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).
The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.
This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Rank and count of the top names for baby boys, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is now updated annually here.
Baby names for children recently born in New York City. This dataset is notable because it includes a breakdown by the ethnicity of the mother of the baby: a source of ethnic information that is missing from many other similar datasets published on state and national levels.
This dataset includes columns for the name, year of birth, sex, and mother's ethnicity of the baby. It also includes a rank column (that name's popularity relative to the rest of the names on the list).
This data is published as-is by the City of New York.
Facebook
TwitterThe dataset contains the first names of the newborns of the city of Pré Saint-Gervais for the period 2018-2020. In the file, there are 688 first names. The structuring of the data is based on the name of the municipality where the children were born (in the majority of cases, the children were born outside the Pré Saint-Gervais because of the absence of maternity in the commune but at least one of the parents comes from the commune), the INSEE number, the sex, the child’s first name and the number of occurrences and the year of birth. These data are useful in order to analyse trends in the choice of first names and thus to understand the history of the city. The data are collected by the General Affairs Department of the commune of Pré Saint-Gervais from birth declarations. The file can be opened in csv format. To get in touch with the manager for this dataset, you can write to Benjamin Mittet-Brême, Director of General Administration, Civil State and Cemetery. Data-visualisation proposals: — Gender distribution of first names by year https://prenomspsg.trial.opendatasoft.com/chart/embed/repartition_des_sexes_des_prenoms_par_annee1/ — Gender distribution of first names over the period 2018-2020 https://prenomspsg.trial.opendatasoft.com/chart/embed/repartition_des_sexes_des_prenoms_sur_la_periode_2018-20201/ — Most used male given names per year (2018-2020) https://prenomspsg.trial.opendatasoft.com/chart/embed/prenoms_de_sexe_masculin_les_plus_utilises_par_annee_2018-2020/ — Most used female given names per year (2018-2020) https://prenomspsg.trial.opendatasoft.com/chart/embed/prenoms_de_sexe_feminin_les_plus_utilises_par_annee_2018-2020/ — The 10 most given names over the period 2018-2020 https://app.workbenchdata.com/workflows/132629/report Dataset published during the Challenge Data week organised by Sciences Po Saint-Germain-en-Laye from February 15 to 19, 2021.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The popularity of baby names is a fascinating reflection of our society's cultural trends and values over time. The dataset on the most popular baby names from 1880 until now provides a comprehensive look at the evolution of naming practices in the United States over the last 140 years. The dataset includes information on the top 1000 baby names for each year, as well as the number of babies given each name, broken down by gender.
By analyzing this dataset, researchers can identify trends and patterns in baby naming, such as the rise and fall of certain names, the influence of popular culture on naming trends, and the impact of immigration on naming practices. This dataset is a valuable resource for researchers, parents, and anyone interested in exploring the social and cultural history of the United States.