The data (name, year of birth, sex, state, and number) are from a 100 percent sample of Social Security card applications starting with 1910. National data is in another dataset.
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset compiles the first version of the worldwide gender-name dictionary (WGND) including 6.2 million names for 182 different countries to disambiguate the gender.
The first names file contains data on the first names attributed to children born in France since 1900. These data are available at the level of France and by department. The files available for download list births and not living people in a given year. They are available in two formats (DBASE and CSV). To use these large files, it is recommended to use a database manager or statistical software. The file at the national level can be opened from some spreadsheets. The file at the departmental level is however too large (3.8 million lines) to be consulted with a spreadsheet, so it is proposed in a lighter version with births since 2000 only. The data can be accessed in: - a national data file containing the first names attributed to children born in France between 1900 and 2022 (data before 2012 relate only to France outside Mayotte) and the numbers by sex associated with each first name; - a departmental data file containing the same information at the department of birth level; - a lighter data file that contains information at the department level of birth since the year 2000.
Official Street Names in the City of Los Angeles created and maintained by the Bureau of Engineering.
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of US baby names from 1910 to 2021. Includes State, Sex, Year, Name, and Count as features.
Mainly used for a tutorial but can be used for classification/other visualizations.
This dataset tracks the updates made on the dataset "Most Popular Baby Names" as a repository for previous versions of the data and metadata.
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .
Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.
Dataset Details Dataset Description
This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.
This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).
From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.
Dataset Sources
Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus
NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.
Dataset Structure
text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus
Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1
Dataset Creation Curation Rationale
For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.
Source Data
The dataset is derived from BookCorpus by filtering it and extracting the template structure.
We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.
Data Collection and Processing
We filter the entries of BookCorpus and include only sentences that meet the following criteria:
Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.
This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.
The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.
Bias, Risks, and Limitations
Due to BookCorpus, only lower-case sentences are contained.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Have you ever wondered where a place got its name from? This page is the beginning of a much larger project that will attempt to geospatially document as much as we are able to regarding Tennessee's history of place names. The data as it currently exists represents many years of work by my friend, the geographer and author, Allen Coggins. He provided me with an Access database with 11,720 records of places with information including: name origin, place description, notes, and the starting and ending dates of any associated post offices. This is the work that he managed to digitize from a card catalog (now in my possession). I estimate that the records in the database represent about 1/3rd of the card catalog's records. The data I present here relates to 2,349 records which I was able to easily (enough) match to the GNIS dataset (which has 77,746 Tennessee records in the version that I accessed 2/28/2025).The sheer volume of this project means that this will take years to develop. Index cards will need to be scanned and digitized. Records will need to be matched and georeferenced. All this will only be to catch up to where Allen got us. Surely we will learn more about origins of place names as the project progresses. We will need to accurately document our work along the way.
The Place Name database is maintained by Survey and Mapping Office of LandsD. The original source of the Database is based on “A Gazetteer of Place Names” with the first edition published in 1960, containing placename features mapped at 1:25,000. The Database stores and maintains names of settlement (e.g. area, town, village), hydrographic features (e.g. river, channel), and topographic features (e.g. relief).
Database of Irish Place Names --> --> External Link--> --> -->
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Top 100 most popular boys' and girls' names.
Source agency: Northern Ireland Statistics and Research Agency
Designation: Official Statistics not designated as National Statistics
Language: English
Alternative title: Babies First Names Bulletin (Northern Ireland)
This table of street names is based on the street directory maintained by the Department of Public Works & Parks (DPW&P) of the City of Worcester, MA. For labeling purposes, the unique name identifier, Street Name ID (NEW_NM_ID), corresponds with appropriate road centerline segments in the separate Street Centerlines dataset. To view the Street Directory visit the City of Worcester Street Directory.Informing Worcester is the City of Worcester's open data portal where interested parties can obtain public information at no cost.
The most popular baby names by sex and mother's ethnicity in New York City.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Search for a business by name. You can obtain business information and then proceed to purchase a certificate of good standing or other documents. The purpose of this search is simply to determine whether a company/entity exists and to provide basic information on the company/entity.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The Place Names gazetteer contains a geographical index of 336 towns and villages across Northern Ireland. The data was derived from OSNI's 1:250,000 Ireland North mapping. The locations represent the label position on the mapping rather than precise real world position. A gazetteer is a geographical index. The Place Names gazetteer contains a list of 336 towns and villages across Northern Ireland. Published here for Open Data. By download or use of this dataset you agree to abide by the LPS Open Government Data Licence.Please Note for Open Data NI Users: Esri Rest API is not Broken, it will not open on its own in a Web Browser but can be copied and used in Desktop and Webmaps
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Tiny Home
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Name lists used for data augmentation for testing biases (in terms of error disparities) of Name Entity Recognition in Danish NLP pipelines.
The following lists are from Statistics Denmark:
The following lists are from Eva Villarsen Meldgaard. 2005. Muslimske fornavne i danmark. Publisher: Københavns Universitet
The list majority_unisex_names.csv is retrieved from The Agency of Family Law in Denmark, and the numbers are retrieved from the above lists from Statistics Denmark.
The list minority_last_names.csv is retrieved from FamilyEducation.
The list overlapping_names.csv contains first names, which both occur on the list of majority names and the list of minority names.
The data (name, year of birth, sex, state, and number) are from a 100 percent sample of Social Security card applications starting with 1910. National data is in another dataset.