100+ datasets found

Baby Names from Social Security Card Applications - National Data
catalog.data.gov
s.cnmilf.com
+1more
Updated May 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2022). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
Explore at:
Dataset updated
May 5, 2022
Dataset provided by
Social Security Administrationhttp://www.ssa.gov/
Description
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
d
Baby Name popularity over time - Dataset - data.govt.nz - discover and use...
catalogue.data.govt.nz
Updated Nov 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Baby Name popularity over time - Dataset - data.govt.nz - discover and use data [Dataset]. https://catalogue.data.govt.nz/dataset/baby-name-popularity-over-time
Explore at:
Dataset updated
Nov 8, 2017
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set lists the sex and number of birth registrations for each first name, from 1900 onward. Years are grouped by the date of the birth registration, not by the date of birth. Some birth registrations are not included, such as registrations with a sex other than Male or Female (i.e. indeterminate or not recorded), or where the birth registration date is not recorded. These excluded records are so few their exclusion is unlikely to have any significant impact on the data. Where a name has less than 10 instances in a particular year, the name will not be included in the data for that year. Due to this, total volumes will be less than the total birth registrations in that year. As first and middle names are recorded in our system together, the first name has been split off from the middle names. Due to the size of the data set, this was done with an automated system, generally looking for the first space in the name. This means there may be names not correctly added. Also, certain symbols in names may not carry through to the data correctly. Please let us know using the contact email address if you find any errors in the data.
Popular Baby Names
kaggle.com
data.cityofnewyork.us
+4more
Updated May 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Utkarsh Singh (2023). Popular Baby Names [Dataset]. https://www.kaggle.com/datasets/utkarshx27/popular-baby-names/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2023
Dataset provided by
Kaggle
Authors
Utkarsh Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
Data from: Validated names for experimental studies on race and ethnicity
zenodo.org
dataverse.harvard.edu
+2more
zip
Updated Mar 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Crabtree; Jae Yeon Kim; Charles Crabtree; Jae Yeon Kim (2023). Validated names for experimental studies on race and ethnicity [Dataset]. http://doi.org/10.5281/zenodo.7741926
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7741926
Dataset updated
Mar 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Charles Crabtree; Jae Yeon Kim; Charles Crabtree; Jae Yeon Kim
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
A large and fast-growing number of studies across the social sciences use experiments to better understand the role of race in human interactions, particularly in the American context. Researchers often use names to signal the race of individuals portrayed in these experiments. However, those names might also signal other attributes, such as socioeconomic status (e.g., education and income) and citizenship. If they do, researchers need pre-tested names with data on perceptions of these attributes. Such data would permit researchers to draw correct inferences about the causal effect of race in their experiments. In this paper, we provide the largest dataset of validated name perceptions based on three different surveys conducted in the United States. In total, our data include over 44,170 name evaluations from 4,026 respondents for 600 names. In addition to respondent perceptions of race, income, education, and citizenship from names, our data also include respondent characteristics. Our data will be broadly helpful for researchers conducting experiments on the manifold ways in which race shapes American life.

License: CC-By Attribution 4.0 International
Names and dates of birth
kaggle.com
zip
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
smitop (2021). Names and dates of birth [Dataset]. https://www.kaggle.com/datasets/smitop/names-and-dates-of-birth/data
Explore at:
zip(1152873 bytes)Available download formats
Dataset updated
Jul 6, 2021
Authors
smitop
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by smitop

Released under CC0: Public Domain

Contents
S
Statistics on Swedish names by birth country 2020
snd.se
pdf, zip
Updated Nov 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter M. Dahlgren (2021). Statistics on Swedish names by birth country 2020 [Dataset]. http://doi.org/10.5878/s91g-y391
Explore at:
zip(232422238), pdf(248253)Available download formats
Unique identifier
https://doi.org/10.5878/s91g-y391
Dataset updated
Nov 2, 2021
Dataset provided by
University of Gothenburg
Swedish National Data Service
Authors
Peter M. Dahlgren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 31, 2020
Area covered
Sweden
Dataset funded by
The Institute for Media Studies
Description
This dataset contains statistics on names (first names of women, first names of men, and last names) by country of birth. In total, there are 231,505 names by 202 countries. The data comes from Statistics Sweden's population statistics (name register) and refers to persons registered in Sweden on December 31st, 2020. However, some names are excluded due to confidentiality, such as names with fewer than five carriers. The data is licensed with Creative Commons Attribution 4.0 International (CC BY 4.0) and may be used as long as Statistics Sweden is stated as the source. In this dataset, you will also find (in addition to the original data from Statistics Sweden) tidied data where the ISO code for each country has been added, as well as data in so-called wide format and long format to facilitate easier data processing.

Please see the Swedish version of the post and the README file for more information about the data.
d
Popular Baby Names - Dataset - data.sa.gov.au
data.sa.gov.au
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Popular Baby Names - Dataset - data.sa.gov.au [Dataset]. https://data.sa.gov.au/data/dataset/popular-baby-names
Explore at:
Dataset updated
Mar 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Australia
Description
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+3more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
Popular White Last Names in the US
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
United States
Description
This dataset represents the popular last names in the United States for White.
Historically Irish Surnames Dataset
zenodo.org
data.niaid.nih.gov
csv, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Crymble; Adam Crymble (2020). Historically Irish Surnames Dataset [Dataset]. http://doi.org/10.5281/zenodo.20985
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.20985
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Adam Crymble; Adam Crymble
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides a list of surnames that are reliably Irish and that can be used for identifying textual references to Irish individuals in the London area and surrounding countryside within striking distance of the capital. This classification of the Irish necessarily includes the Irish-born and their descendants. The dataset has been validated for use on records up to the middle of the nineteenth century, and should only be used in cases in which a few mis-classifications of individuals would not undermine the results of the work, such as large-scale analyses. These data were created through an analysis of the 1841 Census of England and Wales, and validated against the Middlesex Criminal Registers (National Archives HO 26) and the Vagrant Lives Dataset (Crymble, Adam et al. (2014). Vagrant Lives: 14,789 Vagrants Processed by Middlesex County, 1777-1786. Zenodo. 10.5281/zenodo.13103). The sample was derived from the records of the Hundred of Ossulstone, which included much of rural and urban Middlesex, excluding the City of London and Westminster. The analysis was based upon a study of 278,949 adult males. Full details of the methodology for how this dataset was created can be found in the following article, and anyone intending to use this dataset for scholarly research is strongly encouraged to read it so that they understand the strengths and limits of this resource:

Adam Crymble, 'A Comparative Approach to Identifying the Irish in Long Eighteenth Century London', _Historical Methods: A Journal of Quantitative and Interdisciplinary History_, vol. 48, no. 3 (2015): 141-152.

The data here provided includes all 283 names listed in Appendix I of the above paper, but also an additional 209 spelling variations of those root surnames, for a total of 492 names.
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Nishat Anjum
Nafiz Sadman
Kishor Datta Gupta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
d
Mental Health Services Monthly Statistics
digital.nhs.uk
csv, pdf, xls, xlsx
Updated Jul 21, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Mental Health Services Monthly Statistics [Dataset]. https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics
Explore at:
csv(13.0 kB), csv(272.1 kB), pdf(239.2 kB), pdf(729.1 kB), csv(387.3 kB), csv(375.0 kB), csv(1.3 MB), xlsx(118.7 kB), xls(1.1 MB), xls(994.8 kB), xls(389.6 kB), xls(138.2 kB), csv(5.3 kB)Available download formats
Dataset updated
Jul 21, 2016
License
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Time period covered
Mar 1, 2016 - May 31, 2016
Area covered
England
Description
This release presents experimental statistics from the Mental Health Services Data Set (MHSDS), using final submissions for April 2016 and provisional submissions for May 2016. This is the fifth monthly release from the dataset, which replaces the Mental Health and Learning Disabilities Dataset (MHLDDS). As well as analysis of waiting times, first published in March 2016, this release includes elements of the reports that were previously included in monthly reports produced from final MHLDDS submissions. In this publication a new data file has been produced to present the data for people identified as having learning disabilities and/or autistic spectrum disorder (LDA) characteristics. Because of the scope of the changes to the dataset (resulting in the name change to MHSDS and the new name for these monthly reports) it will take time to re-introduce all possible measures that were previously part of the MHLDS Monthly Reports. Additional measures will be added to this report in the coming months. Further details about these changes and the consultation that informed were announced in November. From January 2016 the release includes information on people in children and young people's mental health services, including CAMHS, for the first time. Learning disabilities and autism services have been included since September 2014. This release of final data for April 2016 comprises: - An Executive Summary, which presents national-level analysis across the whole dataset and also for some specific service areas and age groups - Data tables about access and waiting times in mental health services for the based on provisional data for the period 1 March 2016 to 31 May 2016. - A monthly data file which presents 92 measures for mental health, learning disability and autism services at National, Provider and Clinical Commissioning Group (CCG) level. - A Currency and Payments (CAP) data file, containing three measures relating to people assigned to Adult Mental Health Care Clusters. Further measures will be added in future releases. - A data file containing the measures relating to people with learning disabilities and/or autism. - Exploratory analysis of the coverage and completeness of access and waiting times statistics for people entering the Early Intervention in Psychosis pathway. - A set of provider level data quality measures for both months. The report comprises of validity measures for various data items at National and Provider level. From the publication of April data, a coverage report is included showing the number of providers submitting each month and number of records submitted. - A metadata file, which provide contextual information for each measure, including a full description, current uses, method used for analysis and some notes on usage. We will release the reports as experimental statistics until the characteristics of data flowed using the new data standard are understood. A correction has been made to this publication on 10 September 2018. This amendment relates to statistics in the monthly CSV data file; the specific measures effected are listed in the “Corrected Measures” CSV. All listed measures have now been corrected. NHS Digital apologises for any inconvenience caused.
Popular Black Last Names in the US
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular Black Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-black-last-names-in-the-us/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
United States
Description
This dataset represents the popular last names in the United States for Black.
Historic US Census - 1920
redivis.com
application/jsonl +7
Updated Jan 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Center for Population Health Sciences (2020). Historic US Census - 1920 [Dataset]. http://doi.org/10.57761/v43s-pk48
Explore at:
sas, csv, stata, spss, avro, parquet, application/jsonl, arrowAvailable download formats
Unique identifier
https://doi.org/10.57761/v43s-pk48
Dataset updated
Jan 10, 2020
Dataset provided by
Redivis Inc.
Authors
Stanford Center for Population Health Sciences
Time period covered
Jan 1, 1920 - Dec 31, 1920
Area covered
United States
Description
Abstract

The Integrated Public Use Microdata Series (IPUMS) Complete Count Data include more than 650 million individual-level and 7.5 million household-level records. The microdata are the result of collaboration between IPUMS and the nation’s two largest genealogical organizations—Ancestry.com and FamilySearch—and provides the largest and richest source of individual level and household data.

Before Manuscript Submission

All manuscripts (and other items you'd like to publish) must be submitted to

phsdatacore@stanford.edu for approval prior to journal submission.

We will check your cell sizes and citations.

For more information about how to cite PHS and PHS datasets, please visit:

https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

Documentation

Historic data are scarce and often only exists in aggregate tables. The key advantage of historic US census data is the availability of individual and household level characteristics that researchers can tabulate in ways that benefits their specific research questions. The data contain demographic variables, economic variables, migration variables and family variables. Within households, it is possible to create relational data as all relations between household members are known. For example, having data on the mother and her children in a household enables researchers to calculate the mother’s age at birth. Another advantage of the Complete Count data is the possibility to follow individuals over time using a historical identifier.

In sum: the historic US census data are a unique source for research on social and economic change and can provide population health researchers with information about social and economic determinants.

The historic US 1920 census data was collected in January 1920. Enumerators collected data traveling to households and counting the residents who regularly slept at the household. Individuals lacking permanent housing were counted as residents of the place where they were when the data was collected. Household members absent on the day of data collected were either listed to the household with the help of other household members or were scheduled for the last census subdivision.

Notes

We provide household and person data separately so that it is convenient to explore the descriptive statistics on each level. In order to obtain a full dataset, merge the household and person on the variables SERIAL and SERIALP. In order to create a longitudinal dataset, merge datasets on the variable HISTID.

Households with more than 60 people in the original data were broken up for processing purposes. Every person in the large households are considered to be in their own household. The original large households can be identified using the variable SPLIT, reconstructed using the variable SPLITHID, and the original count is found in the variable SPLITNUM.

Coded variables derived from string variables are still in progress. These variables include: occupation and industry.

Missing observations have been allocated and some inconsistencies have been edited for the following variables: SPEAKENG, YRIMMIG, CITIZEN, AGE, BPL, MBPL, FBPL, LIT, SCHOOL, OWNERSHP, MORTGAGE, FARM, CLASSWKR, OCC1950, IND1950, MARST, RACE, SEX, RELATE, MTONGUE. The flag variables indicating an allocated observation for the associated variables can be included in your extract by clicking the ‘Select data quality flags’ box on the extract summary page.

Most inconsistent information was not edited for this release, thus there are observations outside of the universe for some variables. In particular, the variables GQ, and GQTYPE have known inconsistencies and will be improved with the next release.

%3C!-- --%3E

Section 2

This dataset was created on 2020-01-10 18:46:34.647 by merging multiple datasets together. The source datasets for this version were:

IPUMS 1920 households: This dataset includes all households from the 1920 US census.

IPUMS 1920 persons: This dataset includes all individuals from the 1920 US census.

IPUMS 1920 Lookup: This dataset includes variable names, variable labels, variable values, and corresponding variable value labels for the IPUMS 1920 datasets.
E
A corpus of names drawn from the local birth registers of England and Wales,...
dtechtive.com
find.data.gov.scot
txt, xlsx, zip
Updated Jan 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Edinburgh (2018). A corpus of names drawn from the local birth registers of England and Wales, 1838-2014 [Dataset]. http://doi.org/10.7488/ds/2294
Explore at:
xlsx(30.21 MB), zip(5.395 MB), txt(0.0166 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/2294
Dataset updated
Jan 25, 2018
Dataset provided by
University of Edinburgh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
England, UNITED KINGDOM
Description
This dataset comprises a corpus of names, in both the first and middle position, for approximately 22 million individuals born in England and Wales between 1838 and 2014. This data is obtained from birth records made available by a set of volunteer-run genealogical resources - collectively, the 'UK local BMD project' (http://www.ukbmd.org.uk/local) - and has been re-purposed here to demonstrate the applicability of network analysis methods to an onomastic dataset. The ownership and licensing of the intellectual property constituting the original birth records is detailed at https://www.ukbmd.org.uk/TermsAndConditions. Under section 29A of the UK Copyright, Designs and Patents Act 1988, a copyright exception permits copies to be made of lawfully accessible material in order to conduct text and data mining for non-commercial research. The data included in this dataset represents the outcome of such a text-mining analysis. No birth records are included in this dataset, and nor is it possible for records to be reconstructed from the data presented herein. The data comprises an archive of tables, presenting this corpus in various forms: as a rank order of names (in both the first and middle position) by number of registered births per year, and by the total number of births across all years sampled. An overview of the data is also provided, with summary statistics such as the number of usable records registered per year, most popular names per year, and measures of forename diversity and the surname-to-forename usage ratio (an indicator of which forenames are more likely to be transferred uses of surnames). These tables are extensive but not exhaustive, and do not exclude the possibility that errors are present in the corpus. Data are also presented both as '.expression' files (an input format readable by the network analysis tool Graphia Professional) and as '.layout' files, a text file format output by Graphia Professional that describes the characteristics of the network so that it may be replicated. Characteristics of the original birth records that allow the identification of individuals - for instance, full name or location of birth - have been removed.
h
Data from: people-detection
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FourierYe (2025). people-detection [Dataset]. https://huggingface.co/datasets/FourierYe/people-detection
Explore at:
Dataset updated
Mar 26, 2025
Authors
FourierYe
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

Dataset Sources [optional]

Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/FourierYe/people-detection.
Z
higbie/ALWW: American Labor Who's Who (1925) Dataset
data.niaid.nih.gov
zenodo.org
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toby Higbie (2022). higbie/ALWW: American Labor Who's Who (1925) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_597301
Explore at:
Dataset updated
Feb 5, 2022
Dataset authored and provided by
Toby Higbie
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
American Labor Who's Who dataset, version 2.2.0

A dataset derived from the digitized text of Solon de Leon, et al., The American Labor Who's Who (New York: Hanford Press, 1925). This release includes separate files the U.S. and "Other Countries" sections of the directory.

The American Labor Who's Who (ALWW), published in 1925, is a directory of activists in the fields of trade unionism, immigrant rights, civil liberties, progressive and radical politics. The directory includes roughly 1,300 entries for U.S. activists and 300 additional non-US activists. Each entry is a telegraphic biography. Some provide only name, professional title and address at the time of publication, but many sketch rich life histories. Nearly all provide details on birth date and place, family background, education, migration, and work histories, as well as key organizations, events publications, home and work addresses.

The ALWW dataset is derived from the text hosted on the HathiTrust digital library: https://catalog.hathitrust.org/Record/000591300. Faculty, staff, and students at UCLA corrected the plain text from the scanned document and parsed the text into comma-separated fields. This release includes separate files for US entries and "Other Countries" entries. About 30 individuals are listed in the US section with the notation "see other countries," mainly Canadians and Mexicans. This subset is also in a separate file in this release.

For more information about this and related projects see: http://socialjusticehistory.org/projects/networkedlabor/.

Contributors Tobias Higbie, Principal Investigator, UCLA History Department Craig Messner, UCLA Center for Digital Humanities Nick de Carlo, UCLA Center for Digital Humanities Zoe Borovsky, UCLA Library

Contents of Release The US and Other Country datasets were developed separately as reflected in their different version numbers. The US entries are more developed and clean. Consider the Other Country files as beta releases. The files listed below are the most up-to-date available. Previous versions are also available via GitHub.

alww-us-2-2-0.csv (all US entries)

alww-othercountries-o.3.2.csv (all other country entries)

alww-othercountries-0.3.2-subset-crossrefd.csv (other country entries cross-referenced in the US entry section)

Field Layouts

The field layouts for the US and Other Country files are slightly different in this release.

US Entries The fields for the US file (alww-us-2-2-0.csv) include: NAME [first and last], NAME-ALWW [name as it appears in the original text], TITLES [named offices or occupations in 1925], ORGS [compiled list of organizations belonged to at any time], BIRTHDATES [m/d/y where present], BIRTHCOUNTRY [derived from Birthplace], BIRTHPLACE [as listed], FATHER [father's occupation, in a few cases includes mother], CAREER (UNABBREVIATED) [education and experience, usually chronological, most common abbreviations expanded to full words], CAREER (ABBREVIATED) [same as previous with original abbreviations], HOME ADDRESS [where present], WORK ADDRESS [where present], PUBLICATIONS [incomplete], INDEX CATEGORY 1 [categories derived from the ALWW index, many have more than one category], INDEX CATEGORY 2, INDEX CATEGORY 3, INDEX CATEGORY 4, INDEX CATEGORY 5, INDEX CATEGORY 6, INDEX CATEGORY 7, INDEX CATEGORY 8, ORIGINAL [unparsed entry text carried over from earlier versions].

Other Countries The other country files (alww-othercountries-o.3.2.csv and alww-othercountries-0.3.2-subset-crossrefd.csv) include these fields: Name [last, first], Titles [named offices or occupations in 1925], Organizations [compiled list of organizations belonged to at any time], Birthdate [as listed], Birthplace [as listed], Father [father's occupation], Other [same as Career above], HomeAddress [as listed], WorkAddress [as listed], Publications [derived from entries].

Related datasets American Labor Press Directory (1925); American Labor Press Directory (1940); Who's Who in Labor (1947).
d
Alaskan Population Demographic Information from Decennial and American...
search.dataone.org
knb.ecoinformatics.org
+1more
Updated Apr 11, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Census Bureau; Juliet Bachtel; John Randazzo; Erika Gavenus (2019). Alaskan Population Demographic Information from Decennial and American Community Survey Census Data, 1940-2016 [Dataset]. http://doi.org/10.5063/F1XW4H3V
Explore at:
Unique identifier
https://doi.org/10.5063/F1XW4H3V
Dataset updated
Apr 11, 2019
Dataset provided by
Knowledge Network for Biocomplexity
Authors
United States Census Bureau; Juliet Bachtel; John Randazzo; Erika Gavenus
Time period covered
Jan 1, 1940 - Dec 31, 2015
Area covered

Variables measured
lat, lng, Year, city, ANVSA, Negro, Other, Place, White, Aleut., and 145 more
Description
These data comprise Census records relating to the Alaskan people's population demographics for the State of Alaskan Salmon and People (SASAP) Project. Decennial census data were originally extracted from IPUMS National Historic Geographic Information Systems website: https://data2.nhgis.org/main (Citation: Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 12.0 [Database]. Minneapolis: University of Minnesota. 2017. http://doi.org/10.18128/D050.V12.0). A number of relevant tables of basic demographics on age and race, household income and poverty levels, and labor force participation were extracted. These particular variables were selected as part of an effort to understand and potentially quantify various dimensions of well-being in Alaskan communities. The file "censusdata_master.csv" is a consolidation of all 21 other data files in the package. For detailed information on how the datasets vary over different years, view the file "readme.docx" available in this data package. The included .Rmd file is a script which combines the 21 files by year into a single file (censusdata_master.csv). It also cleans up place names (including typographical errors) and uses the USGS place names dataset and the SASAP regions dataset to assign latitude and longitude values and region values to each place in the dataset. Note that some places were not assigned a region or location because they do not fit well into the regional framework. Considerable heterogeneity exists between census surveys each year. While we have attempted to combine these datasets in a way that makes sense, there may be some discrepancies or unexpected values. The RMarkdown document SASAPWebsiteGraphicsCensus.Rmd is used to generate a variety of figures using these data, including the additional file Chignik_population.png. An additional set of 25 figures showing regional trends in population and income metrics are also included.
Australian Government Indigenous Programs & Policy Locations (AGIL) dataset
data.gov.au
data.wu.ac.at
csv +6
Updated Jul 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Services Australia (2024). Australian Government Indigenous Programs & Policy Locations (AGIL) dataset [Dataset]. https://data.gov.au/data/dataset/agil-dataset
Explore at:
csv(3203), kmz(118699), excel (.xlsx)(234063), csv(106128), esri gdb - zipped(84298), esri shapefile - zipped(87049), xml(6200), csv(120644), pdf(66644)Available download formats
Dataset updated
Jul 31, 2024
Dataset authored and provided by
Services Australiahttp://www.humanservices.gov.au/
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered
Australia
Description
This dataset has been developed by the Australian Government as an authoritative source of indigenous location names across Australia. It is sponsored by the Spatial Policy Branch within the Department of Communications and managed solely by the Department of Human Services.

The dataset is designed to support the accurate positioning, consistent reporting, and effective delivery of Australian Government programs and services to indigenous locations.

The dataset contains Preferred and Alternate names for indigenous locations where Australian Government programs and services have been, are being, or may be provided. The Preferred name will always default to a State or Territory jurisdiction's gazetted name so the term 'preferred' does not infer that this is the locally known name for the location. Similarly, locational details are aligned, where possible, with those published in State and Territory registers.

This dataset is NOT a complete listing of all locations at which indigenous people reside. Town and city names are not included in the dataset. The dataset contains names that represent indigenous communities, outstations, defined indigenous areas within a town or city or locations where services have been provided.
d
Index of Marfleets, Pre-1500 - Dataset - B2FIND
b2find.dkrz.de
Updated Apr 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Index of Marfleets, Pre-1500 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/453676de-713d-57b3-8236-c7d01a426734
Explore at:
Dataset updated
Apr 30, 2023
Description
Abstract copyright UK Data Service and data collection copyright owner. The main aim of this project was to record information about individuals pre-1500 with the surname Marfleet or variants of the surname Marfleet. Main Topics: The dataset contains the names of individuals pre-1500 with the surname Marfleet or variants thereof, with details of the source in which they are mentioned, including the date of the source and a transcription or precis of the relevant part. The variants of the surname Marfleet which appear are Mareflete, Marflet, Marflete, Mayrflete, Mereflet, Merfflete, Merfle, Merfleet, Merflet, Merflete, Merfleyt, Meriflet, Mersflet, Mirfleet, Mirflete and Moreflett. No sampling (total universe) Transcription of existing materials Compilation or synthesis of existing material

Facebook

Twitter

Click to copy link

Link copied

Cite

Social Security Administration (2022). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data

Baby Names from Social Security Card Applications - National Data

Explore at:

15 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

May 5, 2022

Dataset provided by

Social Security Administrationhttp://www.ssa.gov/

Description

The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.

Clear search

Close search

Google apps

Main menu

Baby Names from Social Security Card Applications - National Data

Baby Name popularity over time - Dataset - data.govt.nz - discover and use...

Popular Baby Names

Data from: Validated names for experimental studies on race and ethnicity

Names and dates of birth

Dataset

Contents

Statistics on Swedish names by birth country 2020

Popular Baby Names - Dataset - data.sa.gov.au

Geonames - All Cities with a population > 1000

Popular White Last Names in the US

Historically Irish Surnames Dataset

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Mental Health Services Monthly Statistics

Popular Black Last Names in the US

Historic US Census - 1920

Abstract

Before Manuscript Submission

Documentation

Section 2

A corpus of names drawn from the local birth registers of England and Wales,...

Data from: people-detection

higbie/ALWW: American Labor Who's Who (1925) Dataset

Alaskan Population Demographic Information from Decennial and American...

Australian Government Indigenous Programs & Policy Locations (AGIL) dataset

Index of Marfleets, Pre-1500 - Dataset - B2FIND

Baby Names from Social Security Card Applications - National DataSee More Versions

Baby Names from Social Security Card Applications - National Data