94 datasets found
  1. USA Name Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datagov/usa-names
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

    Content

    This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

    All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

    https://cloud.google.com/bigquery/public-data/usa-names

    Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @dcp from Unplash.

    Inspiration

    What are the most common names?

    What are the most common female names?

    Are there more female or male names?

    Female names by a wide margin?

  2. d

    Race and ethnicity data for first, middle, and last names

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosenman, Evan; Olivella, Santiago; Imai, Kosuke (2023). Race and ethnicity data for first, middle, and last names [Dataset]. http://doi.org/10.7910/DVN/SGKW0K
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Rosenman, Evan; Olivella, Santiago; Imai, Kosuke
    Description

    We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.

  3. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  4. Justin Name Data Analysis

    • kaggle.com
    zip
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justin Kelley (2022). Justin Name Data Analysis [Dataset]. https://www.kaggle.com/datasets/justinkelley/justin-name-data-analysis
    Explore at:
    zip(14402 bytes)Available download formats
    Dataset updated
    Mar 8, 2022
    Authors
    Justin Kelley
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    So I just recently finished my Google Data Analytics Certification and I want to stay fresh! I went through the public data sets and found USA names. I thought might be fun to pull some information on just my name and analyze

    Content

    You will find the State, Year and Count

    Acknowledgements

    Big Query Public Data Set USA Names

    Inspiration

    Need to sharpen my skills

  5. d

    Popular Baby Names

    • catalog.data.gov
    • data.cityofnewyork.us
    • +5more
    Updated Jul 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2025). Popular Baby Names [Dataset]. https://catalog.data.gov/dataset/popular-baby-names
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    data.cityofnewyork.us
    Description

    Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.

  6. Baby Names from Social Security Card Applications - National Data

    • catalog.data.gov
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.

  7. d

    First name file since 1900

    • datasets.ai
    • gimi9.com
    0
    Updated Nov 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plateforme ouverte des données publiques françaises (2018). First name file since 1900 [Dataset]. https://datasets.ai/datasets/5bf42c958b4c4144b0110ce8
    Explore at:
    0Available download formats
    Dataset updated
    Nov 20, 2018
    Dataset authored and provided by
    Plateforme ouverte des données publiques françaises
    Description

    The first names file contains data on the first names attributed to children born in France since 1900. These data are available at the level of France and by department.

    The files available for download list births and not living people in a given year. They are available in two formats (DBASE and CSV). To use these large files, it is recommended to use a database manager or statistical software. The file at the national level can be opened from some spreadsheets. The file at the departmental level is however too large (3.8 million lines) to be consulted with a spreadsheet, so it is proposed in a lighter version with births since 2000 only.

    The data can be accessed in: - a national data file containing the first names attributed to children born in France between 1900 and 2022 (data before 2012 relate only to France outside Mayotte) and the numbers by sex associated with each first name; - a departmental data file containing the same information at the department of birth level; - a lighter data file that contains information at the department level of birth since the year 2000.

  8. u

    Labelled FHYA Dataset

    • zivahub.uct.ac.za
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarryd Dunn (2022). Labelled FHYA Dataset [Dataset]. http://doi.org/10.25375/uct.19029692.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    University of Cape Town
    Authors
    Jarryd Dunn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.

  9. h

    PIPPA-ShareGPT-formatted-named

    • huggingface.co
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    minipasila (2024). PIPPA-ShareGPT-formatted-named [Dataset]. https://huggingface.co/datasets/mpasila/PIPPA-ShareGPT-formatted-named
    Explore at:
    Dataset updated
    Apr 20, 2024
    Authors
    minipasila
    License

    https://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/

    Description

    This is modified version of KaraKaraWitch/PIPPA-ShareGPT-formatted. I added randomized names for each conversation and moved the description of the character into a system message and some other cleanup. The randomized names might be causing some problems like the bot using incorrect pronouns etc.

      Original dataset card:
    
    
    
    
    
      KaraKaraWitch/PIPPA-IHaveNeverFeltNeedToSend
    

    I've never felt the need to send a photo of my

    The following is the… See the full description on the dataset page: https://huggingface.co/datasets/mpasila/PIPPA-ShareGPT-formatted-named.

  10. d

    Census Data

    • catalog.data.gov
    • data.globalchange.gov
    • +3more
    Updated Mar 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Bureau of the Census (2024). Census Data [Dataset]. https://catalog.data.gov/dataset/census-data
    Explore at:
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    U.S. Bureau of the Census
    Description

    The Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.

  11. Z

    INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Independent University, Bangladesh
    University of Memphis, USA
    Silicon Orchard Lab, Bangladesh
    Authors
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh, United States
    Description

    Introduction

    There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

    However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

    2 Data-set Introduction

    2.1 Data Collection

    We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

    The headline must have one or more words directly or indirectly related to COVID-19.

    The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

    The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

    Avoid taking duplicate reports.

    Maintain a time frame for the above mentioned newspapers.

    To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

    2.2 Data Pre-processing and Statistics

    Some pre-processing steps performed on the newspaper report dataset are as follows:

    Remove hyperlinks.

    Remove non-English alphanumeric characters.

    Remove stop words.

    Lemmatize text.

    While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

    The primary data statistics of the two dataset are shown in Table 1 and 2.

    Table 1: Covid-News-USA-NNK data statistics

    No of words per headline

    7 to 20

    No of words per body content

    150 to 2100

    Table 2: Covid-News-BD-NNK data statistics No of words per headline

    10 to 20

    No of words per body content

    100 to 1500

    2.3 Dataset Repository

    We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

    3 Literature Review

    Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

    Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

    Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

    Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

    4 Our experiments and Result analysis

    We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

    In February, both the news paper have talked about China and source of the outbreak.

    StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

    Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

    Washington Post discussed global issues more than StarTribune.

    StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

    While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

    We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

    where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,

  12. Distribution of first name and last name frequencies by country

    • figshare.com
    xlsx
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Thelwall (2023). Distribution of first name and last name frequencies by country [Dataset]. http://doi.org/10.6084/m9.figshare.21956795.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mike Thelwall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of first and last name frequencies of academic authors by country.

    Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.

    Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.

    From the paper: Can national researcher mobility be tracked by first or last name uniqueness?

    For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:

    No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China

  13. d

    Factori People Data API | USA | B2B + B2C | ID,Name,Job...

    • datarade.ai
    .json
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori (2023). Factori People Data API | USA | B2B + B2C | ID,Name,Job level,Postal,Email,Loan,Insurance [Dataset]. https://datarade.ai/data-products/factori-person-api-usa-b2b-b2c-id-name-job-level-post-factori
    Explore at:
    .jsonAvailable download formats
    Dataset updated
    Jun 30, 2023
    Dataset authored and provided by
    Factori
    Area covered
    United States of America
    Description

    Factori houses an extensive dataset of US People data, providing valuable insights into individuals across various demographic and behavioral dimensions. Our US People Data section is dedicated to helping you understand the breadth and depth of the information available through our API.

    Data Collection and Aggregation Our People data is gathered and aggregated through surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points. This ensures that the data you access is up-to-date and accurate.

    Here are some of the data categories and attributes we offer within US People Data Graph: - Geography: City, State, ZIP, County, CBSA, Census Tract, etc. - Demographics: Gender, Age Group, Marital Status, Language, etc. - Financial: Income Range, Credit Rating Range, Credit Type, Net Worth Range, etc. - Persona: Consumer type, Communication preferences, Family type, etc. - Interests: Content, Brands, Shopping, Hobbies, Lifestyle, etc. - Household: Number of Children, Number of Adults, IP Address, etc. - Behaviors: Brand Affinity, App Usage, Web Browsing, etc. - Firmographics: Industry, Company, Occupation, Revenue, etc. - Retail Purchase: Store, Category, Brand, SKU, Quantity, Price, etc.

    Here's the data schema: Person_id first_name last_name gender age year month day full_address city state zipcode zip4 delivery_point_bar_code carrier_route walk_sequence_code fips_state_code fips_county_code country_name latitude longtitude address_type metropolitan_statistical_area core_based_statistical_area census_tract census_block
    census_block_group primary_address pre_address street post_address address_suffix address_secondline address_abrev
    census_median_home_value home_market_value property_build_year property_with_ac property_with_pool property_with_water property_with_sewer general_home_value property_fuel_type household_id census_median_household_income
    household_size occupation_home_office
    dwell_type household_income marital_status length_of_residence number_of_kids pre_school_kids single_parent working_women_in_house_hold homeowner children
    adults generations net_worth education_level education_history occupation occuptation_business_owner credit_lines credit_card_user newly_issued_credit_card_user credit_range_new credit_cards loan_to_value and alot more...

  14. Facebook Profiles Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Facebook Profiles Datasets [Dataset]. https://brightdata.com/products/datasets/facebook/profiles
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our Facebook Profiles dataset to explore public profile details such as names, profile and cover photos, work history, education, and photo galleries. Common use cases include people and company research, influencer discovery, and academic studies of career and education signals on Facebook. Over 31M records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

    Profile URL Profile Name Facebook Profile ID Profile Photo Cover Photo Work History (Title, Company, Company ID, Company URL, Start/End Dates) College Education (Name, ID, URL) High School Education (Name, ID, URL) Photo Galleries And much more

  15. Places

    • catalog.data.gov
    • geodata.bts.gov
    • +1more
    Updated Oct 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Commerce, U.S. Census Bureau, Geography Division (Point of Contact) (2025). Places [Dataset]. https://catalog.data.gov/dataset/places2
    Explore at:
    Dataset updated
    Oct 21, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    United States Department of Commercehttp://commerce.gov/
    Description

    The Places dataset was published on September 22, 2025 from the U.S. Department of Commerce, U.S. Census Bureau, Geography Division and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). This resource is a member of a series. The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) System (MTS). The MTS represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The TIGER/Line shapefiles include both incorporated places (legal entities) and census designated places or CDPs (statistical entities). An incorporated place is established to provide governmental functions for a concentration of people as opposed to a minor civil division (MCD), which generally is created to provide services or administer an area without regard, necessarily, to population. Places always nest within a state but may extend across county and county subdivision boundaries. An incorporated place is usually a city, town, village, or borough, but can have other legal descriptions. CDPs are delineated for the decennial census as the statistical counterparts of incorporated places. CDPs are delineated to provide data for settled concentrations of population that are identifiable by name but are not legally incorporated under the laws of the state in which they are located. The boundaries for CDPs are often defined in partnership with state, local, and/or tribal officials and usually coincide with visible features or the boundary of an adjacent incorporated place or another legal entity. CDP boundaries often change from one decennial census to the next with changes in the settlement pattern and development; a CDP with the same name as in an earlier census does not necessarily have the same boundary. The only population/housing size requirement for CDPs is that they must contain some housing and population. The boundaries of most incorporated places in this shapefile are as of January 1, 2024, as reported through the Census Bureau's Boundary and Annexation Survey (BAS). The boundaries of all CDPs were delineated as part of the Census Bureau's Participant Statistical Areas Program (PSAP) for the 2020 Census, but some CDPs were added or updated through the 2024 BAS as well. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1529072

  16. 1,000 People - Spanish Handwriting OCR Dataset

    • nexdata.ai
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 1,000 People - Spanish Handwriting OCR Dataset [Dataset]. https://www.nexdata.ai/datasets/ocr/1405
    Explore at:
    Dataset updated
    Dec 19, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Writer, Data size, Data format, Data content, Accuracy rate, Photographic angle, Collecting environment, Population distribution
    Description

    The writers are Europeans who often write spanish. The device is scanner, the collection angle is eye-level angle. The dataset content includes address, company name, personal name.The dataset can be used for Spanish OCR models and handwritten text recognition systems.

  17. U

    USGS National Transportation Dataset (NTD) Downloadable Data Collection

    • data.usgs.gov
    • catalog.data.gov
    Updated Dec 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey, National Geospatial Technical Operations Center (2024). USGS National Transportation Dataset (NTD) Downloadable Data Collection [Dataset]. https://data.usgs.gov/datacatalog/data/USGS:ad3d631d-f51f-4b6a-91a3-e617d6a58b4e
    Explore at:
    Dataset updated
    Dec 25, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    U.S. Geological Survey, National Geospatial Technical Operations Center
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The USGS Transportation downloadable data from The National Map (TNM) is based on TIGER/Line data provided through U.S. Census Bureau and supplemented with HERE road data to create tile cache base maps. Some of the TIGER/Line data includes limited corrections done by USGS. Transportation data consists of roads, railroads, trails, airports, and other features associated with the transport of people or commerce. The data include the name or route designator, classification, and location. Transportation data support general mapping and geographic information system technology analysis for applications such as traffic safety, congestion mitigation, disaster planning, and emergency response. The National Map transportation data is commonly combined with other data themes, such as boundaries, elevation, hydrography, and structure ...

  18. Data from: Demographic Aspects of First Names

    • kaggle.com
    zip
    Updated May 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GrannySmithApples (2022). Demographic Aspects of First Names [Dataset]. https://www.kaggle.com/datasets/grannysmithapples/demographic-aspects-of-first-names
    Explore at:
    zip(63433 bytes)Available download formats
    Dataset updated
    May 22, 2022
    Authors
    GrannySmithApples
    Description

    Demographic data on first names -- in particular race.

    Notice that there is a row for many first names, and there is a row for "all other first names" that can throw off your analysis.

    I got this data from the Harvard Dataverse: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ/MPMHFE&version=1.3, and I did absolutely nothing and all credit is theirs. I'm surprised I couldn't already find this data on Kaggle, so if there is a reason it shouldn't be here, I will happily take this down.

  19. Name datasets (US, Spain, Argentina, Chile, Italy)

    • kaggle.com
    zip
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosina Scampino (2022). Name datasets (US, Spain, Argentina, Chile, Italy) [Dataset]. https://www.kaggle.com/datasets/rosinascampino/name-datasets-us-spain-argentina-chile-italy/data
    Explore at:
    zip(12920057 bytes)Available download formats
    Dataset updated
    Dec 29, 2022
    Authors
    Rosina Scampino
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Spain, Italy, United States, Chile
    Description

    I started these datasets to learn how to manipulate files in different formats with python. You can see the Github repo here https://github.com/rokelina/names-analysis

  20. Indian Names Dataset

    • kaggle.com
    zip
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Malhotra (2023). Indian Names Dataset [Dataset]. https://www.kaggle.com/datasets/harshm27/indian-names-dataset
    Explore at:
    zip(1128 bytes)Available download formats
    Dataset updated
    Jul 19, 2023
    Authors
    Harsh Malhotra
    Description

    A dataset full of first names can be incredibly helpful in various applications and analyses. Firstly, it can be used in demographic studies to analyze naming trends and patterns over time, providing insights into cultural and societal changes. Additionally, such a dataset can be utilized in market research and targeted advertising, allowing businesses to personalize their marketing strategies based on customers' names. It can also be employed in language processing tasks, such as name entity recognition, sentiment analysis, or gender prediction. Moreover, the dataset can serve as a valuable resource for generating test data, creating fictional characters, or enhancing natural language generation models. Overall, a comprehensive first name dataset has diverse applications across multiple domains.

    This dataset contains common Indian names, this data can be used for any sort of NLP problems such as name generation etc. An upvote would be appreciated :)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datagov/usa-names
Organization logo

USA Name Data

USA Name Data (BigQuery Dataset)

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Data.govhttps://data.gov/
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
United States
Description

Context

Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

Content

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

https://cloud.google.com/bigquery/public-data/usa-names

Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @dcp from Unplash.

Inspiration

What are the most common names?

What are the most common female names?

Are there more female or male names?

Female names by a wide margin?

Search
Clear search
Close search
Google apps
Main menu