100+ datasets found
  1. American Names by Multi-Ethnic/National Origin

    • kaggle.com
    zip
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Teitelbaum (2023). American Names by Multi-Ethnic/National Origin [Dataset]. https://www.kaggle.com/datasets/louisteitelbaum/american-names-by-multi-ethnic-national-origin
    Explore at:
    zip(778154 bytes)Available download formats
    Dataset updated
    Aug 22, 2023
    Authors
    Louis Teitelbaum
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This dataset includes all personal names listed in the Wikipedia category “American people by ethnic or national origin” and all subcategories fitting the pattern “American People of [ ] descent”, in total more than 25,000 individuals. Each individual is represented by a row, with columns indicating binary membership (0/1) in each ethnic/national category.

    Ethnicity inference is an essential tool for identifying disparities in public health and social sciences. Existing datasets linking personal names to ethnic or national origin often neglect to recognize multi-ethnic or multi-national identities. Furthermore, existing datasets use coarse classification schemes (e.g. classifying both Indian and Japanese people as “Asian”) that may not be suitable for many research questions. This dataset remedies these problems by including both very fine-grain ethnic/national categories (e.g. Afghan-Jewish) and more broad ones (e.g. European). Users can chose the categories that are relevant to their research. Since many Americans on Wikipedia are associated with multiple overlapping or distinct ethnicities/nationalities, these multi-ethnic associations are also reflected in the data.

    Data were obtained from the Wikipedia API and reviewed manually to remove stage names, pen names, mononyms, first initials (when full names are available on Wikipedia), nicknames, honorific titles, and pages that correspond to a group or event rather than an individual.

    This dataset was designed for use in training classification algorithms, but may also be independently interesting inasmuch as it is a representative sample of Americans who are famous enough to have their own Wikipedia page, along with detailed information on their ethnic/national origins.

    DISCLAIMER: Due to the incomplete nature of Wikipedia, data may not properly reflect all ethnic national associations for any given individual. For example, there is no guarantee that a given Cuban Jewish person will be listed in both the “American People of Cuban descent” and the “American People of Jewish descent” categories.

  2. d

    Popular Baby Names

    • catalog.data.gov
    • data.cityofnewyork.us
    • +5more
    Updated Jul 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2025). Popular Baby Names [Dataset]. https://catalog.data.gov/dataset/popular-baby-names
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    data.cityofnewyork.us
    Description

    Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.

  3. Baby Names from Social Security Card Applications - National Data

    • catalog.data.gov
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.

  4. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  5. USA Names

    • console.cloud.google.com
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Social%20Security%20Administration&hl=de (2023). USA Names [Dataset]. https://console.cloud.google.com/marketplace/product/social-security-administration/us-names?hl=de
    Explore at:
    Dataset updated
    Jul 15, 2023
    Dataset provided by
    Googlehttp://google.com/
    Area covered
    United States
    Description

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data. All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  6. Baby Names DataSet

    • kaggle.com
    zip
    Updated Mar 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samrat Rai (2019). Baby Names DataSet [Dataset]. https://www.kaggle.com/samrat77/baby-names-dataset
    Explore at:
    zip(87271 bytes)Available download formats
    Dataset updated
    Mar 21, 2019
    Authors
    Samrat Rai
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  7. Indian Names Dataset

    • kaggle.com
    zip
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Malhotra (2023). Indian Names Dataset [Dataset]. https://www.kaggle.com/datasets/harshm27/indian-names-dataset
    Explore at:
    zip(1128 bytes)Available download formats
    Dataset updated
    Jul 19, 2023
    Authors
    Harsh Malhotra
    Description

    A dataset full of first names can be incredibly helpful in various applications and analyses. Firstly, it can be used in demographic studies to analyze naming trends and patterns over time, providing insights into cultural and societal changes. Additionally, such a dataset can be utilized in market research and targeted advertising, allowing businesses to personalize their marketing strategies based on customers' names. It can also be employed in language processing tasks, such as name entity recognition, sentiment analysis, or gender prediction. Moreover, the dataset can serve as a valuable resource for generating test data, creating fictional characters, or enhancing natural language generation models. Overall, a comprehensive first name dataset has diverse applications across multiple domains.

    This dataset contains common Indian names, this data can be used for any sort of NLP problems such as name generation etc. An upvote would be appreciated :)

  8. u

    Labelled FHYA Dataset

    • zivahub.uct.ac.za
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarryd Dunn (2022). Labelled FHYA Dataset [Dataset]. http://doi.org/10.25375/uct.19029692.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    University of Cape Town
    Authors
    Jarryd Dunn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.

  9. USA Name Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datagov/usa-names
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

    Content

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

    All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

    https://cloud.google.com/bigquery/public-data/usa-names

    Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @dcp from Unplash.

    Inspiration

    What are the most common names?

    What are the most common female names?

    Are there more female or male names?

    Female names by a wide margin?

  10. d

    Popular Baby Names - Dataset - data.sa.gov.au

    • data.sa.gov.au
    Updated Mar 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Popular Baby Names - Dataset - data.sa.gov.au [Dataset]. https://data.sa.gov.au/data/dataset/popular-baby-names
    Explore at:
    Dataset updated
    Mar 1, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Australia
    Description

    List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.

  11. Top 100 baby names in England and Wales: historical data

    • ons.gov.uk
    • cy.ons.gov.uk
    xlsx
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2025). Top 100 baby names in England and Wales: historical data [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalestop100babynameshistoricaldata
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Office for National Statisticshttp://www.ons.gov.uk/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    Historic lists of top 100 names for baby boys and girls for 1904 to 2024 at 10-yearly intervals.

  12. Distribution of first name and last name frequencies by country

    • figshare.com
    xlsx
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Mike Thelwall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of first and last name frequencies of academic authors by country.

    Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

    Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

    From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

    For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

    No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China

  13. Popular 5000 Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Popular 5000 Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-5000-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the 5000 popular last names in the United States. The data is split by race to show the percentages against each option.

  14. O

    Top 100 Baby Names

    • data.qld.gov.au
    • researchdata.edu.au
    • +1more
    csv
    Updated Feb 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justice (2025). Top 100 Baby Names [Dataset]. https://www.data.qld.gov.au/dataset/top-100-baby-names
    Explore at:
    csv(2 KiB), csv, csv(200 KiB)Available download formats
    Dataset updated
    Feb 13, 2025
    Dataset authored and provided by
    Justice
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Queensland Top 100 Baby Names

  15. Baby names for girls in England and Wales

    • ons.gov.uk
    • cy.ons.gov.uk
    xlsx
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2025). Baby names for girls in England and Wales [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsgirls
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Office for National Statisticshttp://www.ons.gov.uk/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    Rank and count of the top names for baby girls, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.

  16. Facebook Profiles Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Facebook Profiles Datasets [Dataset]. https://brightdata.com/products/datasets/facebook/profiles
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our Facebook Profiles dataset to explore public profile details such as names, profile and cover photos, work history, education, and photo galleries. Common use cases include people and company research, influencer discovery, and academic studies of career and education signals on Facebook. Over 31M records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

    Profile URL Profile Name Facebook Profile ID Profile Photo Cover Photo Work History (Title, Company, Company ID, Company URL, Start/End Dates) College Education (Name, ID, URL) High School Education (Name, ID, URL) Photo Galleries And much more

  17. Popular White Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the popular last names in the United States for White.

  18. Baby names for boys in England and Wales

    • ons.gov.uk
    • cy.ons.gov.uk
    xlsx
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2025). Baby names for boys in England and Wales [Dataset]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesenglandandwalesbabynamesstatisticsboys
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Office for National Statisticshttp://www.ons.gov.uk/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    Rank and count of the top names for baby boys, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.

  19. Z

    glenglat: Global englacial temperature database

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Welty, Ethan; Jacquemart, Mylène; Carcanade, Guillem; Gastaldello, Marcus; Van Tricht, Lander; Flowers, Gwenn E.; Sugiyama, Shin; Gurung, Tika Ram; Prinz, Rainer; Abermann, Jakob; Steiner, Jakob Friedrich; Barandun, Martina; Gagliardini, Olivier; Vincent, Christian; Thompson, Lonnie G.; Zhang, Tong; Hoelzle, Martin (2025). glenglat: Global englacial temperature database [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_11516611
    Explore at:
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    State Key Laboratory of Earth Surface Processes and Resource Ecology (ESPRE), Beijing Normal University (BNU), China
    Institute of Low Temperature Science (ILTS), Hokkaido University, Japan
    Department of Atmospheric and Cryospheric Sciences (ACINN), University of Innsbruck (UIBK), Austria
    Byrd Polar and Climate Research Center
    Department of Earth Sciences, Simon Fraser University (SFU), Canada
    Department of Geosciences, University of Fribourg (UniFr), Switzerland
    World Glacier Monitoring Service
    Laboratory of Hydraulics, Hydrology and Glaciology (VAW), ETH Zurich, Switzerland
    Institut des Géosciences de l'Environnement
    Department of Water and Climate, Vrije Universiteit Brussel (VUB), Belgium
    Department of Geography and Regional Science, University of Graz, Austria
    International Centre for Integrated Mountain Development
    Authors
    Welty, Ethan; Jacquemart, Mylène; Carcanade, Guillem; Gastaldello, Marcus; Van Tricht, Lander; Flowers, Gwenn E.; Sugiyama, Shin; Gurung, Tika Ram; Prinz, Rainer; Abermann, Jakob; Steiner, Jakob Friedrich; Barandun, Martina; Gagliardini, Olivier; Vincent, Christian; Thompson, Lonnie G.; Zhang, Tong; Hoelzle, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open-access database of englacial temperature measurements compiled from data submissions and published literature. It is developed on GitHub and published to Zenodo. The dataset is described in the following publication:

    Mylène Jacquemart, Ethan Welty, Marcus Gastaldello, and Guillem Carcanade (2025). glenglat: A database of global englacial temperatures. Earth System Science Data Discussions. https://doi.org/10.5194/essd-2024-249

    Dataset structure

    The dataset adheres to the Frictionless Data Tabular Data Package specification. The metadata in datapackage.json describes, in detail, the contents of the tabular data files in the data folder:

    source.csv: Description of each data source (either a personal communication or the reference to a published study).

    borehole.csv: Description of each borehole (location, elevation, etc), linked to source.csv via source_id and less formally via source identifiers in notes.

    profile.csv: Description of each profile (date, etc), linked to borehole.csv via borehole_id and to source.csv via source_id and less formally via source identifiers in notes.

    measurement.csv: Description of each measurement (depth and temperature), linked to profile.csv via borehole_id and profile_id.

    For boreholes with many profiles (e.g. from automated loggers), pairs of profile.csv and measurement.csv are stored separately in subfolders of data named {source.id}-{glacier}, where glacier is a simplified and kebab-cased version of the glacier name (e.g. flowers2022-little-kluane).

    Supporting information

    The folder sources, available on GitHub but omitted from dataset releases on Zenodo, contains subfolders (with names matching column source.id) with files that document how and from where the data was extracted.

    Tables

    Jump to: source · borehole · profile · measurement

    source

    Sources of information considered in the compilation of this database. Column names and categorical values closely follow the Citation Style Language (CSL) 1.0.2 specification. Names of people in non-Latin scripts are followed by a latinization in square brackets (e.g. В. С. Загороднов [V. S. Zagorodnov]) and non-English titles are followed by a translation in square brackets. The family name of Latin-script names is wrapped in curly braces when it is not the last word of the name (e.g. Emmanuel {Le Meur}, e.g. {Duan} Keqin) or the name ends in two or more unabbreviated words (e.g. Jon Ove {Hagen}). The family name of a Chinese name (and of the latinization) is wrapped in curly braces when it is not the first character.

    name type description

    id (required) string Unique identifier constructed from the first author's lowercase, latinized, family name and the publication year, followed as needed by a lowercase letter to ensure uniqueness (e.g. Загороднов 1981 → zagorodnov1981a).

    author string Author names (optionally followed by their ORCID or contact email in parentheses) as a pipe-delimited list.

    year (required) year Year of publication.

    type (required) string Item type.- article-journal: Journal article- book: Book (if the entire book is relevant)- chapter: Book section- document: Document not fitting into any other category- dataset: Collection of data- map: Geographic map- paper-conference: Paper published in conference proceedings- personal-communication: Personal communication between individuals- speech: Presentation (talk, poster) at a conference- report: Report distributed by an institution- thesis-phd: Doctor of Philosophy (PhD) thesis- thesis-msc: Master of Science (MSc) thesis- webpage: Website or page on a website

    title (required) string Item title.

    url string URL (DOI if available).

    language (required) string Language as ISO 639-1 two-letter language code.- da: Danish- de: German- en: English- es: Spanish- fr: French- ja: Japanese- ko: Korean- ru: Russian- sv: Swedish- zh: Chinese

    container_title string Title of the container (e.g. journal, book).

    volume integer Volume number of the item or container.

    issue string Issue number (e.g. 1) or range (e.g. 1-2) of the item or container, with an optional letter prefix (e.g. F1) or part number (e.g. 75pt2).

    page string Page number (e.g. 1) or range (e.g. 1-2) of the item in the container, with an optional letter prefix (e.g. S1).

    version string Version number (e.g. 1.0) of the item.

    editor string Editor names (e.g. of the containing book) as a pipe-delimited list.

    collection_title string Title of the collection (e.g. book series).

    collection_number string Number (e.g. 1) or range (e.g. 1-2) in the collection (e.g. book series volume).

    publisher string Publisher name.

    borehole

    Metadata about each borehole.

    name type description

    id (required) integer Unique identifier.

    source_id (required) string Identifier of the source of the earliest temperature measurements. This is also the source of the borehole attributes unless otherwise stated in notes.

    glacier_name (required) string Glacier or ice cap name (as reported).

    glims_id string Global Land Ice Measurements from Space (GLIMS) glacier identifier.

    location_origin (required) string Origin of location (latitude, longitude).- submitted: Provided in data submission- published: Reported as coordinates in original publication- digitized: Digitized from published map with complete axes- estimated: Estimated from published plot by comparing to a map (e.g. Google Maps, CalTopo)- guessed: Estimated with difficulty, for example by comparing elevation to a map (e.g. Google Maps, CalTopo)

    latitude (required) number [degree] Latitude (EPSG 4326).

    longitude (required) number [degree] Longitude (EPSG 4326).

    elevation_origin (required) string Origin of elevation (elevation).- submitted: Provided in data submission- published: Reported as number in original publication- digitized: Digitized from published plot with complete axes- estimated: Estimated from elevation contours in published map- guessed: Estimated with difficulty, for example by comparing location (latitude, longitude) to a map of contemporary elevations (e.g. CalTopo, Google Maps)

    elevation (required) number [m] Elevation above sea level.

    mass_balance_area string Mass balance area.- ablation: Ablation area- equilibrium: Near the equilibrium line- accumulation: Accumulation area

    label string Borehole name (e.g. as labeled on a plot).

    date_min date (%Y-%m-%d) Begin date of drilling, or if not known precisely, the first possible date (e.g. 2019 → 2019-01-01).

    date_max date (%Y-%m-%d) End date of drilling, or if not known precisely, the last possible date (e.g. 2019 → 2019-12-31).

    drill_method string Drilling method.- mechanical: Push, percussion, rotary- thermal: Hot point, electrothermal, steam- combined: Mechanical and thermal

    ice_depth number [m] Starting depth of continuous ice. Infinity (INF) indicates that only snow, firn, or intermittent ice was reached.

    depth number [m] Total borehole depth (not including drilling in the underlying bed).

    to_bed boolean Whether the borehole reached the glacier bed.

    temperature_uncertainty number [°C] Estimated temperature uncertainty (as reported).

    notes string Additional remarks about the study site, the borehole, or the measurements therein as a pipe-delimited list. Sources are referenced by source.id. Quality concerns are prefixed with '[flag]'.

    curator string Names of people who added the data to the database, as a pipe-delimited list.

    investigators string Names of people and/or agencies who performed the work, as a pipe-delimited list. Each entry is in the format 'person (agency; ...) {notes}', where only person or one agency is required. Person and agency may contain a latinized form in square brackets.

    funding string Funding sources as a pipe-delimited list. Each entry is in the format 'funder [rorid] > award [number] url', where only funder is required and rorid is the funder's ROR (https://ror.org) ID (e.g. 01jtrvx49).

    profile

    Date and time of each measurement profile.

    name type description

    borehole_id (required) integer Borehole identifier.

    id (required) integer Borehole profile identifier (starting from 1 for each borehole).

    source_id (required) string Source identifier.

    measurement_origin (required) string Origin of measurements (measurement.depth, measurement.temperature).- submitted: Provided as numbers in data submission- published: Numbers read from original publication- digitized-discrete: Digitized with Plot Digitizer from discrete points of depth versus temperature- digitized-continuous: Digitized with Plot Digitizer from a continuous data source (e.g. line plot of depth versus temperature)

    date_min date (%Y-%m-%d) Measurement date, or if not known precisely, the first possible date (e.g. 2019 → 2019-01-01).

    date_max (required) date (%Y-%m-%d) Measurement date, or if not known precisely, the last possible date (e.g. 2019 → 2019-12-31).

    time time (%H:%M:%S) Measurement time.

    utc_offset number [h] Time offset relative to Coordinated Universal Time (UTC).

    equilibrium string Whether and how reported temperatures equilibrated following drilling.- true: Equilibrium was measured- estimated: Equilibrium was estimated (typically by extrapolation)- false: Equilibrium was not reached

    notes string Additional remarks about the profile or the measurements therein as a pipe-delimited list. Sources are referenced by source.id. Quality concerns are prefixed with '[flag]'.

    measurement

    Temperature measurements with depth.

    name type description

    borehole_id (required) integer Borehole identifier.

    profile_id (required) integer Borehole profile identifier.

    depth (required) number [m] Depth below the glacier surface.

    temperature (required) number [°C] Temperature.

  20. d

    Data on Alaskan Population demographics ranging from 1940 to 2015

    • dataone.org
    • search.dataone.org
    • +1more
    Updated Feb 7, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Census Bureau; Juliet Bachtel; John Randazzo (2019). Data on Alaskan Population demographics ranging from 1940 to 2015 [Dataset]. http://doi.org/10.5063/F1CV4FZX
    Explore at:
    Dataset updated
    Feb 7, 2019
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    United States Census Bureau; Juliet Bachtel; John Randazzo
    Time period covered
    Jan 1, 1940 - Dec 31, 2015
    Area covered
    Variables measured
    lat, lng, Year, city, ANVSA, Negro, Other, Place, White, Aleut., and 138 more
    Description

    These data comprise Census records relating to the Alaskan people's population demographics for the State of Alaskan Salmon and People (SASAP) Project. Decennial census data were originally extracted from IPUMS National Historic Geographic Information Systems website: https://data2.nhgis.org/main(Citation: Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 12.0 [Database]. Minneapolis: University of Minnesota. 2017. http://doi.org/10.18128/D050.V12.0). A number of relevant tables of basic demographics on age and race, household income and poverty levels, and labor force participation were extracted.

      These particular variables were selected as part of an effort to understand and potentially quantify various dimensions of well-being in Alaskan communities.
      The file "censusdata_master.csv" is a consolidation of all 21 other data files in the package. For detailed information on how the datasets vary over different years, view the file "readme.docx" available in this data package.
    
      The included .Rmd file is a script which combines the 21 files by year into a single file (censusdata_master.csv). It also cleans up place names (including typographical errors) and uses the
      USGS place names dataset and the SASAP regions dataset to assign latitude and longitude values and region values to each place in the dataset. Note that some places were not assigned a region or
      location because they do not fit well into the regional framework.
    
      Considerable heterogeneity exists between census surveys each year. While we have attempted to combine these datasets in a way that makes sense, there may be some discrepancies or unexpected values.
      Please send a description of any unusual values to the dataset contact.
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Louis Teitelbaum (2023). American Names by Multi-Ethnic/National Origin [Dataset]. https://www.kaggle.com/datasets/louisteitelbaum/american-names-by-multi-ethnic-national-origin
Organization logo

American Names by Multi-Ethnic/National Origin

25,540 Americans, 491 Overlapping Ethnic/National Categories

Explore at:
zip(778154 bytes)Available download formats
Dataset updated
Aug 22, 2023
Authors
Louis Teitelbaum
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered
United States
Description

This dataset includes all personal names listed in the Wikipedia category “American people by ethnic or national origin” and all subcategories fitting the pattern “American People of [ ] descent”, in total more than 25,000 individuals. Each individual is represented by a row, with columns indicating binary membership (0/1) in each ethnic/national category.

Ethnicity inference is an essential tool for identifying disparities in public health and social sciences. Existing datasets linking personal names to ethnic or national origin often neglect to recognize multi-ethnic or multi-national identities. Furthermore, existing datasets use coarse classification schemes (e.g. classifying both Indian and Japanese people as “Asian”) that may not be suitable for many research questions. This dataset remedies these problems by including both very fine-grain ethnic/national categories (e.g. Afghan-Jewish) and more broad ones (e.g. European). Users can chose the categories that are relevant to their research. Since many Americans on Wikipedia are associated with multiple overlapping or distinct ethnicities/nationalities, these multi-ethnic associations are also reflected in the data.

Data were obtained from the Wikipedia API and reviewed manually to remove stage names, pen names, mononyms, first initials (when full names are available on Wikipedia), nicknames, honorific titles, and pages that correspond to a group or event rather than an individual.

This dataset was designed for use in training classification algorithms, but may also be independently interesting inasmuch as it is a representative sample of Americans who are famous enough to have their own Wikipedia page, along with detailed information on their ethnic/national origins.

DISCLAIMER: Due to the incomplete nature of Wikipedia, data may not properly reflect all ethnic national associations for any given individual. For example, there is no guarantee that a given Cuban Jewish person will be listed in both the “American People of Cuban descent” and the “American People of Jewish descent” categories.

Search
Clear search
Close search
Google apps
Main menu