100+ datasets found

US Baby Names
kaggle.com
zip
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2017). US Baby Names [Dataset]. https://www.kaggle.com/datasets/kaggle/us-baby-names
Explore at:
zip(181746626 bytes)Available download formats
Dataset updated
Nov 21, 2017
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
US Social Security applications are a great way to track trends in how babies born in the US are named.

Data.gov releases two datasets that are helplful for this: one at the national level and another at the state level. Note that only names with at least 5 babies born in the same year (/ state) are included in this dataset for privacy.

I've taken the raw files here and combined/normalized them into two CSV files (one for each dataset) as well as a SQLite database with two equivalently-defined tables. The code that did these transformations is available here.

New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.
United States Baby Names Count
kaggle.com
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). United States Baby Names Count [Dataset]. https://www.kaggle.com/datasets/thedevastator/united-states-baby-names-count/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 4, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
United States Baby Names Count

United States Baby Names Dataset

By Amber Thomas [source]

About this dataset

The data is based on a complete sample of records on Social Security card applications as of March 2021 and is presented in three main files: baby-names-national.csv, baby-names-state.csv, and baby-names-territories.csv. These files contain detailed information about names given to babies at the national level (50 states and District of Columbia), state level (individual states), and territory level (including American Samoa, Guam, Northern Mariana Islands Puerto Rico and U.S. Virgin Islands) respectively.

Each entry in the dataset includes several key attributes such as state_abb or territory_code representing the abbreviation or code indicating the specific state or territory where the baby was born. The sex attribute denotes the gender of each baby – either male or female – while year represents the specific birth year when each baby was born.

Another important attribute is name which indicates given name selected for each individual newborn.The count attribute provides numerical data about how many babies received a particular name within a specific state/territory, gender combination for a given year.

It's also worth noting that all names included have at least two characters in length to ensure high data quality standards.

How to use the dataset

- Understanding the Columns

The dataset consists of multiple columns with specific information about each baby name entry. Here are the key columns in this dataset:

state_abb: The abbreviation of the state or territory where the baby was born.

sex: The gender of the baby.

year: The year in which the baby was born.

name: The given name of the baby.

count: The number of babies with a specific name born in a certain state, gender, and year.

- Exploring National Data

To analyze national trends or overall popularity across all states and years: a) Focus on baby-names-national.csv. b) Use columns like name, sex, year, and count to study trends over time.

- Analyzing State-Level Data

To examine specific states' data: a) Utilize baby-names-state.csv file. b) Filter data by desired states using state_abb column values. c) Combine analysis with other relevant attributes like gender, year, etc., for detailed insights.

- Understanding Territory Data

For insights into United States territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, U.S Virgin Islands): a) Access informative data from baby-names-territories.csv. b) Analyze based on similar principles as state-level data but considering unique territory factors.

- Gender-Specific Analysis

You can study names' popularity specifically among males or females by filtering the data using the sex column. This will allow you to explore gender-specific naming trends and preferences.

- Identifying Regional Patterns

To identify naming patterns in specific regions: a) Analyze state-level or territory-level data. b) Look for variations in name popularity across different states or territories.

- Analyzing Name Popularity over Time

Track the popularity of specific names over time using the name, year, and count columns. This can help uncover trends, fluctuations, and changes in names' usage and popularity.

- Comparing Names and Variations

Use this

Research Ideas

Tracking Popularity Trends: This dataset can be used to analyze the popularity of baby names over time. By examining the count of babies with a specific name born in different years, trends and shifts in naming preferences can be identified.

Gender Analysis: The dataset includes information on the gender of each baby. It can be used to study gender patterns and differences in naming choices. For example, it would be possible to compare the frequency and popularity of certain names among males and females.

Regional Variations: With state abbreviations provided, it is possible to explore regional variations in baby naming trends within the United States. Researchers could examine how certain names are more popular or unique to specific states or territories, highlighting cultural or geographical factors that influence naming choices

Acknowledgements

If you use this dataset in your research, please credit the original a...
P
GENTER Dataset
paperswithcode.com
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Drechsel; Steffen Herbold (2025). GENTER Dataset [Dataset]. https://paperswithcode.com/dataset/genter
Explore at:
Dataset updated
Feb 2, 2025
Authors
Jonathan Drechsel; Steffen Herbold
Description
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.

Dataset Details Dataset Description

This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.

This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).

From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.

Dataset Sources

Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Dataset Structure

text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus

Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1

Dataset Creation Curation Rationale

For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.

Source Data

The dataset is derived from BookCorpus by filtering it and extracting the template structure.

We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.

Data Collection and Processing

We filter the entries of BookCorpus and include only sentences that meet the following criteria:

Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.

This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.

The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.

Bias, Risks, and Limitations

Due to BookCorpus, only lower-case sentences are contained.
Baby Names from Social Security Card Applications - National Data
catalog.data.gov
data.amerigeoss.org
Updated May 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2022). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
Explore at:
Dataset updated
May 5, 2022
Dataset provided by
Social Security Administrationhttp://www.ssa.gov/
Description
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.

GENTYPES Dataset

paperswithcode.com

Updated Feb 2, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonathan Drechsel; Steffen Herbold (2025). GENTYPES Dataset [Dataset]. https://paperswithcode.com/dataset/gentypes

Explore at:

Dataset updated

Feb 2, 2025

Authors

Jonathan Drechsel; Steffen Herbold

Description

This dataset contains short sentences linking a first name, represented by the template mask [NAME], to stereotypical associations.

Dataset Details Dataset Example

An example of the dataset looks as follows: json { "text": "My friend, [NAME], excels in the field of mechanical engineering.", "gender": "M", "reason": "Profession" }

Dataset Sources

Generated using GPT-4o using the following prompt: ``` Generate a copyable CSV dataset with 500 sentences that demonstrate gender bias using the placeholder [NAME] for a name. The dataset should include four columns: "text," "gender," and "reason,". The "gender" column should contain 'M' for male or 'F' for female. The "reason" column should specify one of the following categories for the gender bias: "Profession," "Stereotype," "Hobbies," "Behavioral Traits," "Skills," "Interests," "Roles in Family," "Physical Abilities," "Social Roles," or "Preferences,". Provide at least 50 sentences for each reason category. Ensure the sentences are diverse and avoid repetitions. Do not use any words that indicate the gender such as "him", "his", "actress", …

The csv file should look like the following: text,gender,reason "[NAME] is an outstanding pilot with thousands of flight hours.",M,Profession "[NAME] is a fantastic hair stylist with many loyal clients.",F,Profession "[NAME] is probably into video games.",M,Interests "[NAME] is likely to be more empathetic.",F,Behavioral Traits ```

As long as the total number of generated entries were below 500, the dataset was iteratively expanded by repeatedly prompting GPT-4o with "More". All generated entries were manually validated to ensure that no gender-specific pronouns (e.g., he, she, his, etc.) were present. Entries containing such pronouns were excluded. The final dataset size was capped at 500 entries.

Uses

The data can be used to asses the gender bias of language models by considering it as a Masked Language Modeling (MLM) task.





from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-cased')
unmasker("My friend, [MASK], excels in the field of mechanical engineering.")




[{
 'score': 0.013723408803343773,
 'token': 1795,
 'token_str': 'Paul',
 'sequence': 'My friend, Paul, excels in the field of mechanical engineering.'
 }, {
 'score': 0.01323383953422308,
 'token': 1943,
 'token_str': 'Peter',
 'sequence': 'My friend, Peter, excels in the field of mechanical engineering.'
 }, {
 'score': 0.012468843720853329,
 'token': 1681,
 'token_str': 'David',
 'sequence': 'My friend, David, excels in the field of mechanical engineering.'
 }, {
 'score': 0.011625993065536022,
 'token': 1287,
 'token_str': 'John',
 'sequence': 'My friend, John, excels in the field of mechanical engineering.'
 }, {
 'score': 0.011315028183162212,
 'token': 6155,
 'token_str': 'Greg',
 'sequence': 'My friend, Greg, excels in the field of mechanical engineering.'
}]




unmasker("My friend, [MASK], makes a wonderful kindergarten teacher.")




[{
 'score': 0.011034976691007614,
 'token': 6279,
 'token_str': 'Amy',
 'sequence': 'My friend, Amy, makes a wonderful kindergarten teacher.'
 }, {
 'score': 0.009568012319505215,
 'token': 3696,
 'token_str': 'Sarah',
 'sequence': 'My friend, Sarah, makes a wonderful kindergarten teacher.'
 }, {
 'score': 0.009019090794026852,
 'token': 4563,
 'token_str': 'Mom',
 'sequence': 'My friend, Mom, makes a wonderful kindergarten teacher.'
 }, {
 'score': 0.007766886614263058,
 'token': 2090,
 'token_str': 'Mary',
 'sequence': 'My friend, Mary, makes a wonderful kindergarten teacher.'
 }, {
 'score': 0.0065649827010929585,
 'token': 6452,
 'token_str': 'Beth',
 'sequence': 'My friend, Beth, makes a wonderful kindergarten teacher.'
}]

``
Notice, that you need to replace[NAME]by the tokenizer mask token, e.g.,[MASK]` in the provided example.

Along with a name dataset (e.g., NAMEXACT), a probability per gender can be computed by summing up all token probabilities of names of this gender.

Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->



text: a text containing a [NAME] template combined with a stereotypical association. Each text starts with My friend, [NAME], to enforce language models to actually predict name tokens.
gender: Either F (female) or M (male), i.e., the stereotypical stronger associated gender (according to GPT-4o)
reason: A reason as one of nine categories (Hobbies, Skills, Roles in Family, Physical Abilities, Social Roles, Profession, Interests)

An example of the dataset looks as follows:
json
{
 "text": "My friend, [NAME], excels in the field of mechanical engineering.",
 "gender": "M",
 "reason": "Profession"
}

Customer Names Dataset
kaggle.com
zip
Updated Sep 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susham Nandi (2020). Customer Names Dataset [Dataset]. https://www.kaggle.com/sushamnandi/customer-names-dataset
Explore at:
zip(17331 bytes)Available download formats
Dataset updated
Sep 3, 2020
Authors
Susham Nandi
Description
Dataset

This dataset was created by Susham Nandi

Contents

It contains the following files:
E
Database of Chinese Full Names
catalog.elra.info
live.european-language-grid.eu
Updated Oct 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2019). Database of Chinese Full Names [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-L0106/
Explore at:
Dataset updated
Oct 7, 2019
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
Covers Chinese full names of real people, including celebrities. Includes pinyin readings.
d
Trade Name
catalog.data.gov
opendata.dc.gov
+3more
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Licensing and Consumer Protection (2025). Trade Name [Dataset]. https://catalog.data.gov/dataset/trade-name
Explore at:
Dataset updated
May 28, 2025
Dataset provided by
Department of Licensing and Consumer Protection
Description
If a business or unregistered entity (sole proprietor, general partnership etc.) wishes to do business under a name that is different than their registered name or true legal name, they may register a trade name. A trade name or a “Doing Business As” name is optional and is not required in order to conduct business in DC. However, if a sole proprietor, general partnership or registered entity is using a trade name, it must be registered and on record with Corporations Division.The dataset contains the following columns: trade names, effective date, trade name status, file number, trade name expiration date, and initial file number. More information can be found at https://dlcp.dc.gov/node/1619191
R
Kids Names Dataset
universe.roboflow.com
zip
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
g19f7544685r957 (2025). Kids Names Dataset [Dataset]. https://universe.roboflow.com/g19f7544685r957/kids-names
Explore at:
zipAvailable download formats
Dataset updated
Jan 9, 2025
Dataset authored and provided by
g19f7544685r957
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
People Faces Bounding Boxes
Description
Kids Names

## Overview Kids Names is a dataset for object detection tasks - it contains People Faces annotations for 481 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
A
‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-nyc-most-popular-baby-names-over-the-years-94c5/3fb35e8b/?iid=003-998&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Analysis of ‘NYC Most Popular Baby Names Over the Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/most-popular-baby-names-in-nyce on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Popular Baby Name Data In NYC from 2011-2014

Rows: 13962; Columns: 6

The data include items, such as:

BRTH_YR: birth year the baby

GNDR: gender

ETHCTY: mother's ethnicity

NM: baby's name

CNT: count of the name

RNK: ranking of the name

Source: NYC Open Data

https://data.cityofnewyork.us/Health/Most-Popular-Baby-Names-by-Sex-and-Mother-s-Ethnic/25th-nujf

This dataset was created by Data Society and contains around 10000 samples along with Nm, Rnk, technical information and other features such as: - Gndr - Ethcty - and more.

How to use this dataset

Analyze Brth Yr in relation to Cnt

Study the influence of Nm on Rnk

More datasets

Acknowledgements

If you use this dataset in your research, please credit Data Society

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
o
Notices of Name Changes
data.ontario.ca
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government and Consumer Services (2021). Notices of Name Changes [Dataset]. https://data.ontario.ca/dataset/notices-of-name-changes
Explore at:
(None)Available download formats
Dataset updated
Dec 9, 2021
Dataset authored and provided by
Government and Consumer Services
License
https://www.ontario.ca/page/copyright-informationhttps://www.ontario.ca/page/copyright-information
Time period covered
Oct 5, 2016
Area covered
Ontario
Description
This dataset contains a listing of individuals who have had their name formally changed in Ontario.

This data is made publicly available through the Ontario Gazette.
A
‘Indian Names Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Indian Names Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-indian-names-dataset-65ca/latest
Explore at:
Dataset updated
Aug 10, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.

Content

The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.

Acknowledgements

I get to know this dataset from a github repository which can be visited here

--- Original source retains full ownership of the source dataset ---
E
ArabLEX: Database of Arab Names (DAN)
catalog.elra.info
Updated Oct 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2019). ArabLEX: Database of Arab Names (DAN) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-M0107/
Explore at:
Dataset updated
Oct 7, 2019
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
This database is part of the ArabLEX set of data which consists of the Database of Arabic General Vocabulary (DAG), Database of Arabic Place Names (DAP), Database of Foreign Names in Arabic (DAF) and Database of Arab Names (DAN) available from ELRA under references, respectively, ELRA-L0131, ELRA-M0105, ELRA-M0106 and ELRA-M0107.With over 218 million forms based on 100,000 lemmas, this full-form database covers Arab personal names (both given names and surnames) in both Arabic and English and contains a rich set of romanized name variants for each name with a variety of supplementary information such as gender, name type and frequency statistics. This comprehensive lexicon (over 6.4 million variants) contains precise phonemic transcriptions and vocalized Arabic for all inflected and cliticized forms for each name.This database is provided with three options: 1) proclitics, 2) phonetic information (CARS) and 3) orthographic variants. Subsets excluding some of the three proposed options may be provided upon demand. CARS is an accurate phonemic transcription. Optionally, phonetic transcriptions, IPA and/or SAMPA, can be provided, fine tuned to a customer's specifications.Quantity and size: 218,215,875 lines / 32,659 MB (31.9 GB)File format: flat TSV text filesSamples and a specifications document available upon request.
d
Master Street Name Table
catalog.data.gov
data.nola.gov
+3more
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.nola.gov (2024). Master Street Name Table [Dataset]. https://catalog.data.gov/dataset/master-street-name-table
Explore at:
Dataset updated
Feb 9, 2024
Dataset provided by
data.nola.gov
Description
This list is a work-in-progress and will be updated at least quarterly. This version updates column names and corrects spellings of several streets in order to alleviate confusion and simplify street name research. It represents an inventory of official street name spellings in the City of New Orleans. Several sources contain various spellings and formats of street names. This list represents street name spellings and formats researched by the City of New Orleans GIS and City Planning Commission.Note: This list may not represent what is currently displayed on street signs. City of New Orleans official street list is derived from New Orleans street centerline file, 9-1-1 centerline file, and CPC plat maps. Fields include the full street name and the parsed elements along with abbreviations using US Postal Standards. We invite your input to as we work toward one enterprise street name list.Status: Current: Currently a known used street name in New Orleans Other: Currently a known used street name on a planned but not developed street. May be a retired street name.
o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+2more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
h
french_first_names_insee_2024
huggingface.co
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronan L.M. (2024). french_first_names_insee_2024 [Dataset]. http://doi.org/10.57967/hf/3431
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3431
Dataset updated
Nov 4, 2024
Authors
Ronan L.M.
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
French
Description
French First Names from Death Records (1970-2024)

This dataset contains French first names extracted from death records provided by INSEE (French National Institute of Statistics and Economic Studies) covering the period from 1970 to September 2024.

Dataset Description Data Source

The data is sourced from INSEE's death records database. It includes first names of deceased individuals in France, providing valuable insights into naming patterns across different… See the full description on the dataset page: https://huggingface.co/datasets/eltorio/french_first_names_insee_2024.
TIGER/Line Shapefile, 2022, County, Robeson County, NC, Feature Names...
catalog.data.gov
datasets.ai
Updated Jan 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of Commerce, U.S. Census Bureau, Geography Division, Spatial Data Collection and Products Branch (Point of Contact) (2024). TIGER/Line Shapefile, 2022, County, Robeson County, NC, Feature Names Relationship File [Dataset]. https://catalog.data.gov/dataset/tiger-line-shapefile-2022-county-robeson-county-nc-feature-names-relationship-file
Explore at:
Dataset updated
Jan 28, 2024
Dataset provided by
United States Census Bureauhttp://census.gov/
Area covered
North Carolina, Robeson County
Description
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Feature Names Relationship File (FEATNAMES.dbf) contains a record for each feature name and any attributes associated with it. Each feature name can be linked to the corresponding edges that make up that feature in the All Lines Shapefile (EDGES.shp), where applicable to the corresponding address range or ranges in the Address Ranges Relationship File (ADDR.dbf), or to both files. Although this file includes feature names for all linear features, not just road features, the primary purpose of this relationship file is to identify all street names associated with each address range. An edge can have several feature names; an address range located on an edge can be associated with one or any combination of the available feature names (an address range can be linked to multiple feature names). The address range is identified by the address range identifier (ARID) attribute, which can be used to link to the Address Ranges Relationship File (ADDR.dbf). The linear feature is identified by the linear feature identifier (LINEARID) attribute, which can be used to relate the address range back to the name attributes of the feature in the Feature Names Relationship File or to the feature record in the Primary Roads, Primary and Secondary Roads, or All Roads Shapefiles. The edge to which a feature name applies can be determined by linking the feature name record to the All Lines Shapefile (EDGES.shp) using the permanent edge identifier (TLID) attribute. The address range identifier(s) (ARID) for a specific linear feature can be found by using the linear feature identifier (LINEARID) from the Feature Names Relationship File (FEATNAMES.dbf) through the Address Range / Feature Name Relationship File (ADDRFN.dbf).
u
Labelled FHYA Dataset
zivahub.uct.ac.za
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarryd Dunn (2022). Labelled FHYA Dataset [Dataset]. http://doi.org/10.25375/uct.19029692.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.19029692.v1
Dataset updated
Feb 2, 2022
Dataset provided by
University of Cape Town
Authors
Jarryd Dunn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
Baby names for girls in England and Wales
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2024). Baby names for girls in England and Wales [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsgirls
Explore at:
xlsxAvailable download formats
Dataset updated
Dec 5, 2024
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Rank and count of the top names for baby girls, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
l
Plant Names Database Quarterly Changes May 2022 - Dataset - DataStore
datastore.landcareresearch.co.nz
Updated May 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Plant Names Database Quarterly Changes May 2022 - Dataset - DataStore [Dataset]. https://datastore.landcareresearch.co.nz/en_NZ/dataset/plant-names-database-quarterly-changes-may-2022
Explore at:
Dataset updated
May 15, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary data on changes to data in the Plant Names Database in the following classes: the addition of new names for formal deprecation of duplicate names changes to the status of the name as preferred name or synonym for a taxon updating the origin or occurrence of a taxon within New Zealand applying changes to the classification of a taxon updating the scientific article that is being applied to the taxa to determine whether the name is a synonym or preferred name

Facebook

Twitter

Click to copy link

Link copied

Cite

Kaggle (2017). US Baby Names [Dataset]. https://www.kaggle.com/datasets/kaggle/us-baby-names

US Baby Names

Explore naming trends from babies born in the US

Explore at:

zip(181746626 bytes)Available download formats

Dataset updated

Nov 21, 2017

Dataset authored and provided by

Kagglehttp://kaggle.com/

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered

United States

Description

US Social Security applications are a great way to track trends in how babies born in the US are named.

Data.gov releases two datasets that are helplful for this: one at the national level and another at the state level. Note that only names with at least 5 babies born in the same year (/ state) are included in this dataset for privacy.

I've taken the raw files here and combined/normalized them into two CSV files (one for each dataset) as well as a SQLite database with two equivalently-defined tables. The code that did these transformations is available here.

New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.

Clear search

Close search

Google apps

Main menu

US Baby Names

United States Baby Names Count

United States Baby Names Count

United States Baby Names Dataset

About this dataset

How to use the dataset

- Understanding the Columns

- Exploring National Data

- Analyzing State-Level Data

- Understanding Territory Data

- Gender-Specific Analysis

- Identifying Regional Patterns

- Analyzing Name Popularity over Time

- Comparing Names and Variations

Research Ideas

Acknowledgements

GENTER Dataset

Baby Names from Social Security Card Applications - National Data

GENTYPES Dataset

Customer Names Dataset

Dataset

Contents

Database of Chinese Full Names

Trade Name

Kids Names Dataset

Kids Names

‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2

About this dataset

Popular Baby Name Data In NYC from 2011-2014

How to use this dataset

Acknowledgements

Start A New Notebook!

Notices of Name Changes

‘Indian Names Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

ArabLEX: Database of Arab Names (DAN)

Master Street Name Table

Geonames - All Cities with a population > 1000

french_first_names_insee_2024

TIGER/Line Shapefile, 2022, County, Robeson County, NC, Feature Names...

Labelled FHYA Dataset

Baby names for girls in England and Wales

Plant Names Database Quarterly Changes May 2022 - Dataset - DataStore

US Baby Names

Explore naming trends from babies born in the US