81 datasets found
  1. f

    Distribution of first name and last name frequencies by country

    • figshare.com
    xlsx
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    figshare
    Authors
    Mike Thelwall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of first and last name frequencies of academic authors by country.

    Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

    Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

    From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

    For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

    No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China

  2. Popular 5000 Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Popular 5000 Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-5000-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the 5000 popular last names in the United States. The data is split by race to show the percentages against each option.

  3. Customer Names Dataset

    • kaggle.com
    zip
    Updated Sep 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susham Nandi (2020). Customer Names Dataset [Dataset]. https://www.kaggle.com/sushamnandi/customer-names-dataset
    Explore at:
    zip(17331 bytes)Available download formats
    Dataset updated
    Sep 3, 2020
    Authors
    Susham Nandi
    Description

    Dataset

    This dataset was created by Susham Nandi

    Contents

    It contains the following files:

  4. Popular White Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the popular last names in the United States for White.

  5. f

    Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  6. Baby Names from Social Security Card Applications - National Data

    • catalog.data.gov
    • data.amerigeoss.org
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2022). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.

  7. Z

    Historically Irish Surnames Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crymble, Adam (2020). Historically Irish Surnames Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_20985
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Crymble, Adam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a list of surnames that are reliably Irish and that can be used for identifying textual references to Irish individuals in the London area and surrounding countryside within striking distance of the capital. This classification of the Irish necessarily includes the Irish-born and their descendants. The dataset has been validated for use on records up to the middle of the nineteenth century, and should only be used in cases in which a few mis-classifications of individuals would not undermine the results of the work, such as large-scale analyses. These data were created through an analysis of the 1841 Census of England and Wales, and validated against the Middlesex Criminal Registers (National Archives HO 26) and the Vagrant Lives Dataset (Crymble, Adam et al. (2014). Vagrant Lives: 14,789 Vagrants Processed by Middlesex County, 1777-1786. Zenodo. 10.5281/zenodo.13103). The sample was derived from the records of the Hundred of Ossulstone, which included much of rural and urban Middlesex, excluding the City of London and Westminster. The analysis was based upon a study of 278,949 adult males. Full details of the methodology for how this dataset was created can be found in the following article, and anyone intending to use this dataset for scholarly research is strongly encouraged to read it so that they understand the strengths and limits of this resource:

    Adam Crymble, 'A Comparative Approach to Identifying the Irish in Long Eighteenth Century London', _Historical Methods: A Journal of Quantitative and Interdisciplinary History_, vol. 48, no. 3 (2015): 141-152.
    

    The data here provided includes all 283 names listed in Appendix I of the above paper, but also an additional 209 spelling variations of those root surnames, for a total of 492 names.

  8. o

    Turkish Name & Surname Generator Data

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Turkish Name & Surname Generator Data [Dataset]. https://www.opendatabay.com/data/ai-ml/6fdd4bb8-39ba-4f66-816c-ffd11495bea7
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset offers a collection of Turkish names and surnames that are randomly generated. Its primary purpose is to provide data for scenarios where actual personal names are not required or desired, thus ensuring privacy and enabling versatile data applications. The names are derived from Wikipedia, while surnames originate from a GitHub repository, with the generation process being entirely random.

    Columns

    • Name: Represents a randomly generated Turkish first name.
    • Surname: Represents a randomly generated Turkish surname.

    Distribution

    The dataset is typically provided in a CSV file format. While the exact count of rows or records is not specified, its intended distribution covers a global region. The data is created through random generation, meaning its structure is focused on yielding distinct names and surnames rather than mirroring real-world population distributions.

    Usage

    This dataset is ideal for developing and testing applications that necessitate placeholder or synthetic human names. Key use cases include: * Synthetic data creation for building privacy-focused development and testing environments. * Training of Artificial Intelligence (AI) and Large Language Models (LLMs), where diverse but non-real personal data is beneficial. * Populating databases or forms with example Turkish names. * Supporting data science and analytics projects that benefit from anonymised or randomised datasets.

    Coverage

    The geographical scope of this dataset is global. There is no specific time range or demographic coverage noted, as the names and surnames are randomly generated for general utility rather than to reflect particular real-world populations or historical periods.

    License

    CC0

    Who Can Use It

    This dataset is particularly valuable for: * Data scientists and analysts who require synthetic data for model training or privacy-centric research. * Software developers in need of placeholder data for application testing and development. * Researchers in AI and LLM fields seeking diverse, randomly generated linguistic inputs. * Anyone who requires free, non-sensitive name data for various projects.

    Dataset Name Suggestions

    • Turkish Random Names
    • Synthetic Turkish Names
    • Generated Turkish Names
    • Turkish Name & Surname Generator Data
    • Random Turkish Personas

    Attributes

    Original Data Source: Türkçe isimler

  9. P

    GENTYPES Dataset

    • paperswithcode.com
    Updated Feb 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Drechsel; Steffen Herbold (2025). GENTYPES Dataset [Dataset]. https://paperswithcode.com/dataset/gentypes
    Explore at:
    Dataset updated
    Feb 2, 2025
    Authors
    Jonathan Drechsel; Steffen Herbold
    Description

    This dataset contains short sentences linking a first name, represented by the template mask [NAME], to stereotypical associations.

    Dataset Details Dataset Example

    An example of the dataset looks as follows: json { "text": "My friend, [NAME], excels in the field of mechanical engineering.", "gender": "M", "reason": "Profession" }

    Dataset Sources

    Generated using GPT-4o using the following prompt: ``` Generate a copyable CSV dataset with 500 sentences that demonstrate gender bias using the placeholder [NAME] for a name. The dataset should include four columns: "text," "gender," and "reason,". The "gender" column should contain 'M' for male or 'F' for female. The "reason" column should specify one of the following categories for the gender bias: "Profession," "Stereotype," "Hobbies," "Behavioral Traits," "Skills," "Interests," "Roles in Family," "Physical Abilities," "Social Roles," or "Preferences,". Provide at least 50 sentences for each reason category. Ensure the sentences are diverse and avoid repetitions. Do not use any words that indicate the gender such as "him", "his", "actress", …

    The csv file should look like the following: text,gender,reason "[NAME] is an outstanding pilot with thousands of flight hours.",M,Profession "[NAME] is a fantastic hair stylist with many loyal clients.",F,Profession "[NAME] is probably into video games.",M,Interests "[NAME] is likely to be more empathetic.",F,Behavioral Traits ```

    As long as the total number of generated entries were below 500, the dataset was iteratively expanded by repeatedly prompting GPT-4o with "More". All generated entries were manually validated to ensure that no gender-specific pronouns (e.g., he, she, his, etc.) were present. Entries containing such pronouns were excluded. The final dataset size was capped at 500 entries.

    Uses

    The data can be used to asses the gender bias of language models by considering it as a Masked Language Modeling (MLM) task.

    
    
    
    
    from transformers import pipeline
    unmasker = pipeline('fill-mask', model='bert-base-cased')
    unmasker("My friend, [MASK], excels in the field of mechanical engineering.")
    
    
    
    
    [{
     'score': 0.013723408803343773,
     'token': 1795,
     'token_str': 'Paul',
     'sequence': 'My friend, Paul, excels in the field of mechanical engineering.'
     }, {
     'score': 0.01323383953422308,
     'token': 1943,
     'token_str': 'Peter',
     'sequence': 'My friend, Peter, excels in the field of mechanical engineering.'
     }, {
     'score': 0.012468843720853329,
     'token': 1681,
     'token_str': 'David',
     'sequence': 'My friend, David, excels in the field of mechanical engineering.'
     }, {
     'score': 0.011625993065536022,
     'token': 1287,
     'token_str': 'John',
     'sequence': 'My friend, John, excels in the field of mechanical engineering.'
     }, {
     'score': 0.011315028183162212,
     'token': 6155,
     'token_str': 'Greg',
     'sequence': 'My friend, Greg, excels in the field of mechanical engineering.'
    }]
    
    
    
    
    unmasker("My friend, [MASK], makes a wonderful kindergarten teacher.")
    
    
    
    
    [{
     'score': 0.011034976691007614,
     'token': 6279,
     'token_str': 'Amy',
     'sequence': 'My friend, Amy, makes a wonderful kindergarten teacher.'
     }, {
     'score': 0.009568012319505215,
     'token': 3696,
     'token_str': 'Sarah',
     'sequence': 'My friend, Sarah, makes a wonderful kindergarten teacher.'
     }, {
     'score': 0.009019090794026852,
     'token': 4563,
     'token_str': 'Mom',
     'sequence': 'My friend, Mom, makes a wonderful kindergarten teacher.'
     }, {
     'score': 0.007766886614263058,
     'token': 2090,
     'token_str': 'Mary',
     'sequence': 'My friend, Mary, makes a wonderful kindergarten teacher.'
     }, {
     'score': 0.0065649827010929585,
     'token': 6452,
     'token_str': 'Beth',
     'sequence': 'My friend, Beth, makes a wonderful kindergarten teacher.'
    }]
    
    ``
    Notice, that you need to replace[NAME]by the tokenizer mask token, e.g.,[MASK]` in the provided example.
    
    Along with a name dataset (e.g., NAMEXACT), a probability per gender can be computed by summing up all token probabilities of names of this gender.
    
    Dataset Structure
    <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
    
    
    
    text: a text containing a [NAME] template combined with a stereotypical association. Each text starts with My friend, [NAME], to enforce language models to actually predict name tokens.
    gender: Either F (female) or M (male), i.e., the stereotypical stronger associated gender (according to GPT-4o)
    reason: A reason as one of nine categories (Hobbies, Skills, Roles in Family, Physical Abilities, Social Roles, Profession, Interests)
    
    An example of the dataset looks as follows:
    json
    {
     "text": "My friend, [NAME], excels in the field of mechanical engineering.",
     "gender": "M",
     "reason": "Profession"
    }
    
  10. h

    fun-club-name-generator-dataset

    • huggingface.co
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell (2025). fun-club-name-generator-dataset [Dataset]. https://huggingface.co/datasets/Laurenfromhere/fun-club-name-generator-dataset
    Explore at:
    Dataset updated
    Apr 5, 2025
    Authors
    Mitchell
    Description

    Fun Club Name Generator Dataset

    This is a small, handcrafted dataset of random and fun club name ideas.The goal is to help people who are stuck naming something — whether it's a book club, a gaming group, a project, or just a Discord server between friends.

      Why this?
    

    A few friends and I spent hours trying to name a casual group — everything felt cringey, too serious, or already taken. We started writing down names that made us laugh, and eventually collected enough to… See the full description on the dataset page: https://huggingface.co/datasets/Laurenfromhere/fun-club-name-generator-dataset.

  11. Popular Hispanic Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Hispanic Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-hispanic-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the popular last names in the United States for Hispanic.

  12. P

    GENTER Dataset

    • paperswithcode.com
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Drechsel; Steffen Herbold (2025). GENTER Dataset [Dataset]. https://paperswithcode.com/dataset/genter
    Explore at:
    Dataset updated
    Feb 25, 2025
    Authors
    Jonathan Drechsel; Steffen Herbold
    Description

    This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

    Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.

    Dataset Details Dataset Description

    This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.

    This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).

    From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.

    Dataset Sources

    Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus

    NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

    Dataset Structure

    text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus

    Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1

    Dataset Creation Curation Rationale

    For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.

    Source Data

    The dataset is derived from BookCorpus by filtering it and extracting the template structure.

    We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.

    Data Collection and Processing

    We filter the entries of BookCorpus and include only sentences that meet the following criteria:

    Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.

    This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.

    The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.

    Bias, Risks, and Limitations

    Due to BookCorpus, only lower-case sentences are contained.

  13. 🥳 Famous Birthdays

    • kaggle.com
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). 🥳 Famous Birthdays [Dataset]. https://www.kaggle.com/datasets/mexwell/famous-birthdays
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Kaggle
    Authors
    mexwell
    Description

    The "Famous Birthdays" Kaggle notebook is a comprehensive dataset comprising the birthdays of 4,700 well-known individuals. The dataset provides insightful information about these celebrities, including their names, the number of articles written about them, their birth dates, and their zodiac signs. The columns included in this dataset are:

    • Name: The full name of the famous person.
    • Lastname: The last name of the individual.
    • Firstname: The first name of the individual.
    • ArticleNum: unknown column
    • BirthDate: The full birth date of the individual.
    • BirthMonth: The month in which the person was born.
    • BirthDay: The specific day of the month on which the person was born.
    • Zodiac: The zodiac sign corresponding to the birth date of the individual.

    This notebook serves as a valuable resource for analyzing patterns and trends among famous personalities based on their birth information. For instance, users can explore which zodiac signs are most common among celebrities or identify any seasonal trends in birth dates.

    Acknowlegement

    Foto von Adi Goldstein auf Unsplash

  14. Popular American Indian and Alaskan Native Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular American Indian and Alaskan Native Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-american-indian-and-alaskan-native-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the popular last names in the United States for American Indian and Alaskan Native.

  15. 2010 Decennial Census of Population and Housing: Surnames

    • catalog.data.gov
    • gimi9.com
    Updated Sep 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Census Bureau (2023). 2010 Decennial Census of Population and Housing: Surnames [Dataset]. https://catalog.data.gov/dataset/2010-decennial-census-of-population-and-housing-surnames
    Explore at:
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Description

    The Census Bureau's Census surnames product is a data release based on names recorded in the decennial census. The product contains rank and frequency data on surnames reported 100 or more times in the decennial census, along with Hispanic origin and race category percentages. The latter are suppressed where necessary for confidentiality. The data focus on summarized aggregates of counts and characteristics associated with surnames, and the data do not in any way identify any specific individuals.

  16. Z

    The Con Espressione Game Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chowdhury, Shreyan (2020). The Con Espressione Game Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3968827
    Explore at:
    Dataset updated
    Nov 5, 2020
    Dataset provided by
    Widmer, Gerhard
    Chowdhury, Shreyan
    Cancino-Chacón, Carlos Eduardo
    Peter, Silvan
    Aljanaki, Anna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Con Espressione Game Dataset

    A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen words (preferably: adjectives), how they perceive the expressive character of the different performances. The aim of this research is to find the dimensions of musical expression (in Western classical piano music) that can be attributed to a performance, as perceived and described in natural language by listeners.

    The Con Espressione Game was launched on the 3rd of April 2018.

    Dataset structure

    Listeners’ Descriptions of Expressive performance

    piece_performer_data.csv: A comma separated file (CSV) containing information about the pieces in the dataset. Strings are delimited with ". The columns in this file are:

    music_id: An integer ID for each performance in the dataset.

    performer_name: (Last) name of the performer.

    piece_name: (Short) name of the piece.

    performance_name: Name of the the performance. All files in different modalities (alignments, MIDI, loudness features, etc) corresponding to a single performance will have the same name (but possibly different extensions).

    composer: Name of the composer of the piece.

    piece: Full name of the piece.

    album: Name of the album.

    performer_name_full: Full name of the performer.

    year_of_CD_issue: Year of the issue of the CD.

    track_number: Number of the track in the CD.

    length_of_excerpt_seconds: Length of the excerpt in seconds.

    start_of_excerpt_seconds: Start of the excerpt in its corresponding track (in seconds).

    end_of_excerpt_seconds: End of the excerpt in its corresponding track (in seconds).

    con_espressione_game_answers.csv: This is the main file of the dataset which contains listener’s descriptions of expressive character. This CSV file contains the following columns:

    answer_id: An integer representing the ID of the answer. Each answer gets a unique ID.

    participant_id: An integer representing the ID of a participant. Answers with the same ID come from the same participant.

    music_id: An integer representing the ID of the performance. This is the same as the music_id in piece_performer_data.csv described above.

    answer: (cleaned/formatted) participant description. All answers have been written as lower-case, typos were corrected, spaces replaced by underscores (_) and individual terms are separated by commas. See cleanup_rules.txt for a more detailed description of how the answers were formatted.

    original_answer: Raw answers provided by the participants.

    timestamp: Timestamp of the answer.

    favorite: A boolean (0 or 1) indicating if this performance of the piece is the participant’s favorite.

    translated_to_english. Raw translation (from German, Russian, Spanish and Italian).

    performer. (Last) name of the performer. See piece_performer_data.csv described above.

    piece_name. (Short) name of the piece. See piece_performer_data.csv described above.

    performance_name. Name of the performance. See piece_performer_data.csv described above.

    participant_profiles.csv. A CSV file containing musical background information of the participants. Empty cells mean that the participant did not provide an answer. This file contains the following columns:

    participant_id: An integer representing the ID of a participant.

    music_education_years: (Self reported) number of years of musical education of the participants

    listening_to_classical_music: Answers to the question “How often do you listen to classical music?”. The possible answers are:

    1: Never

    2: Very rarely

    3: Rarely

    4: Occasionally

    5: Frequently

    6: Very frequently

    registration_date: Date and time of registration of the participant.

    playing_piano: Answer to the question “Do you play the piano?”. The possible answers are

    1: No

    2: A little bit

    3: Quite well

    4: Very well

    cleanup_rules.txt: Rules for cleaning/formatting the terms in the participant’s answers.

    translations_GERMAN.txt: How the translations from German to English were made.

    Metadata

    Related meta data is stored in the MetaData folder.

    Alignments. This folders contains the manually-corrected score-to-performance alignments for each of the pieces in the dataset. Each of these alignments is a text file.

    ApproximateMIDI. This folder contains reconstructed MIDI performances created from the alignments and the loudness curves. The onset time and offset times of the notes were determined from the alignment times and the MIDI velocity was computed from the loudness curves.

    Match. This folder contains score-to-performance alignments in Matchfile format.

    Scores_MuseScore. Manually encoded sheet music in MuseScore format (.mscz)

    Scores_MusicXML. Sheet music in MusicXML format.

    Scores_pdf. Images of the sheet music in pdf format.

    Audio Features

    Audio features computed from the audio files. These features are located in the AudioFeatures folder.

    Loudness: Text files containing loudness curves in dB of the audio files. These curves were computed using code provided by Olivier Lartillot. Each of these files contains the following columns:

    performance_time_(seconds): Performance time in seconds.

    loudness_(db): Loudness curve in dB.

    smooth_loudness_(db): Smoothed loudness curve.

    Spectrograms. Numpy files (.npy) containing magnitude spectrograms (as Numpy arrays). The shape of each array is (149 frequency bands, number of frames of the performance). The spectrograms were computed from the audio files with the following parameters:

    Sample rate (sr): 22050 samples per second

    Window length: 2048

    Frames per Second (fps): 31.3 fps

    Hop size: sample_rate // fps = 704

    Filterbank: log scaled filterbank with 24 bands per octave and min frequency 20 Hz

    MIDI Performances

    Since the dataset consists of commercial recordings, we cannot include the audio files in the dataset. We can, however, share the 2 synthesized MIDI performances used in the Con Espressione game (for Bach’s Prelude in C and the second movement of Mozart’s Sonata in C K 545) in mp3 format. These performances can be found in the MIDIPerformances folder.

  17. Most Popular Baby Names

    • data.chhs.ca.gov
    • data.ca.gov
    • +3more
    csv, zip
    Updated Dec 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Public Health (2024). Most Popular Baby Names [Dataset]. https://data.chhs.ca.gov/dataset/most-popular-baby-names-2005-current
    Explore at:
    csv(1219), csv(121160), zipAvailable download formats
    Dataset updated
    Dec 30, 2024
    Dataset authored and provided by
    California Department of Public Healthhttps://www.cdph.ca.gov/
    Description

    This dataset contains ranks and counts for the top 25 baby names by sex for live births that occurred in California (by occurrence) based on information entered on birth certificates.

  18. g

    Statistics on Swedish names by birth country 2020

    • gimi9.com
    • demo.researchdata.se
    • +1more
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Statistics on Swedish names by birth country 2020 [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-s91g-y391/
    Explore at:
    Dataset updated
    Nov 2, 2021
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    This dataset contains statistics on names (first names of women, first names of men, and last names) by country of birth. In total, there are 231,505 names by 202 countries. The data comes from Statistics Sweden's population statistics (name register) and refers to persons registered in Sweden on December 31st, 2020. However, some names are excluded due to confidentiality, such as names with fewer than five carriers. The data is licensed with Creative Commons Attribution 4.0 International (CC BY 4.0) and may be used as long as Statistics Sweden is stated as the source. In this dataset, you will also find (in addition to the original data from Statistics Sweden) tidied data where the ISO code for each country has been added, as well as data in so-called wide format and long format to facilitate easier data processing. Please see the Swedish version of the post and the README file for more information about the data.

  19. O

    Hartford, Connecticut Courant Index 1764 – 1799

    • data.ct.gov
    • catalog.data.gov
    application/rdfxml +5
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CT State Library (2024). Hartford, Connecticut Courant Index 1764 – 1799 [Dataset]. https://data.ct.gov/w/pjit-gmy5/wqz6-rhce?cur=DErrqMLlnvB&from=5hSCxYOehBP
    Explore at:
    xml, application/rdfxml, json, csv, application/rssxml, tsvAvailable download formats
    Dataset updated
    Jul 2, 2024
    Dataset authored and provided by
    CT State Library
    Area covered
    Hartford, Connecticut
    Description

    The Connecticut Courant Index, 1764-1799 contains thousands of entries searchable by Name, Town, or Subject taken from a slip index found in the History & Genealogy Reading Room of the Connecticut State Library. It includes advertisements by name of merchant or tradesmen, though not by individual items for sale. Names of specific articles offered for sale may be found under Merchants in (name of town). Druggists in (name of town), etc.

    This index omits vital statistics, [except deaths of Connecticut people in other places], probate notices, tax sale lists, notices of cattle and horses lost and found, post office lists of letters (except by name of town, post office, letters held), and notices of farms, houses and land for sale.

    In the spring of 2001, Kathryn Black, a volunteer from the Connecticut Professional Genealogists Council, Inc., along with State Library staff, began entering the information on the slips to The Connecticut Courant Index, 1764 -1799 into a database in order to provide researchers remote access to this wonderful resource. After 45,857 entries, volunteers and staff completed the database in December of 2007. This database is of great significance as it will likely display entries for articles not found in the scanned and indexed Historical Hartford Courant as the machine readable software used may not have correctly read the printed word as found on the original newspaper correctly.

    Database Fields

    Below, researchers will find a list and explanation (when necessary) of the fields used in the database.

    Last Name In addition to surnames, this field contains single names, pseudonyms, and the names of businesses and organizations.

    First Name

    Middle Name

    Title This field includes: Forms of address; Religious, Military, and Honorary titles; Pseudonyms; and designation of participation, usually within a group.

    Town The first town that appears in the entry is listed, unless it is not a Connecticut town and a Connecticut town appears later in the entry.

    State The state is listed only when it is not Connecticut.

    Subject Headings in this comprehensive index range from “Accidents,” to mention of the Ship “Zephyr,” with many entries in between.

    County This field is only listed when it appeared on the slip.

    Issue Date The dates are listed in day, month, year format, e.g. 01 Jan 1765.

    Page The page number where the entry was found.

    Column The column number where the entry was found.

    Cross Reference “See” and “See also” references are listed here. For example, those searching for the term “Loyalists” are directed to “See” or search under “Tories.” Similarly, those searching for “Shoemakers” are advised to “See also” or also search under the term “Bootmakers.”

    Important Guidelines on Searching When searching for a first name of an individual, be sure to also check the Last Name Field, as single names and pseudonyms are listed here.

    When searching for a first or last name, also check the Subject Field, as the names of businesses and organizations are entered here. Also, keep in mind that a person can be the Subject of an entry.

    Check for the names of towns, states, and counties in the Subject Field as well as the individual fields.

  20. d

    Trade Name

    • catalog.data.gov
    • opendata.dc.gov
    • +4more
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Licensing and Consumer Protection (2025). Trade Name [Dataset]. https://catalog.data.gov/dataset/trade-name
    Explore at:
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Department of Licensing and Consumer Protection
    Description

    If a business or unregistered entity (sole proprietor, general partnership etc.) wishes to do business under a name that is different than their registered name or true legal name, they may register a trade name. A trade name or a “Doing Business As” name is optional and is not required in order to conduct business in DC. However, if a sole proprietor, general partnership or registered entity is using a trade name, it must be registered and on record with Corporations Division.The dataset contains the following columns: trade names, effective date, trade name status, file number, trade name expiration date, and initial file number. More information can be found at https://dlcp.dc.gov/node/1619191

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2

Distribution of first name and last name frequencies by country

Explore at:
xlsxAvailable download formats
Dataset updated
Feb 2, 2023
Dataset provided by
figshare
Authors
Mike Thelwall
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Distribution of first and last name frequencies of academic authors by country.

Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China

Search
Clear search
Close search
Google apps
Main menu