100+ datasets found

f
Is Demography Destiny? Application of Machine Learning Techniques to...
plos.figshare.com
figshare.com
docx
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Luo; Thin Nguyen; Melanie Nichols; Truyen Tran; Santu Rana; Sunil Gupta; Dinh Phung; Svetha Venkatesh; Steve Allender (2023). Is Demography Destiny? Application of Machine Learning Techniques to Accurately Predict Population Health Outcomes from a Minimal Demographic Dataset [Dataset]. http://doi.org/10.1371/journal.pone.0125602
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0125602
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Wei Luo; Thin Nguyen; Melanie Nichols; Truyen Tran; Santu Rana; Sunil Gupta; Dinh Phung; Svetha Venkatesh; Steve Allender
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For years, we have relied on population surveys to keep track of regional public health statistics, including the prevalence of non-communicable diseases. Because of the cost and limitations of such surveys, we often do not have the up-to-date data on health outcomes of a region. In this paper, we examined the feasibility of inferring regional health outcomes from socio-demographic data that are widely available and timely updated through national censuses and community surveys. Using data for 50 American states (excluding Washington DC) from 2007 to 2012, we constructed a machine-learning model to predict the prevalence of six non-communicable disease (NCD) outcomes (four NCDs and two major clinical risk factors), based on population socio-demographic characteristics from the American Community Survey. We found that regional prevalence estimates for non-communicable diseases can be reasonably predicted. The predictions were highly correlated with the observed data, in both the states included in the derivation model (median correlation 0.88) and those excluded from the development for use as a completely separated validation sample (median correlation 0.85), demonstrating that the model had sufficient external validity to make good predictions, based on demographics alone, for areas not included in the model development. This highlights both the utility of this sophisticated approach to model development, and the vital importance of simple socio-demographic characteristics as both indicators and determinants of chronic disease.
LivWell: a sub-national database on the Living conditions of Women and their...
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Nov 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Camille Belmin; Camille Belmin; Roman Hoffmann; Roman Hoffmann; Mahmoud Elkasabi; Mahmoud Elkasabi; Peter-Paul Pichler; Peter-Paul Pichler (2022). LivWell: a sub-national database on the Living conditions of Women and their Well-being for 52 countries [Dataset]. http://doi.org/10.5281/zenodo.5821533
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5821533
Dataset updated
Nov 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Camille Belmin; Camille Belmin; Roman Hoffmann; Roman Hoffmann; Mahmoud Elkasabi; Mahmoud Elkasabi; Peter-Paul Pichler; Peter-Paul Pichler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LivWell is a global longitudinal database which provides a range of key indicators related to women’s socioeconomic status, health and well-being, access to basic services, and demographic outcomes. Data are available at the sub-national level for 52 countries and 447 regions. A total of 134 indicators are based on 199 Demographic and Health Surveys for the period 1990-2019, supplemented by extensive information on socioeconomic and climatic conditions in the respective regions for a total of 190 indicators. The resulting data offer various opportunities for policy-relevant research on gender inequality, inclusive development, and demographic trends at the sub-national level.

For a full description, please refer to the article describing the database here: (link to come)

The companion repository livwelldata allows to easily use the database in R. The R package can be downloaded following the instructions on the following git repository: https://gitlab.pik-potsdam.de/belmin/livwelldata. The version of the database in the package is the same as in this repository.
California Population Trends by Geography
data.cnra.ca.gov
data.ca.gov
csv, website
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Water Resources (2025). California Population Trends by Geography [Dataset]. https://data.cnra.ca.gov/dataset/population-trends-by-geography
Explore at:
csv(317335), websiteAvailable download formats
Dataset updated
Apr 22, 2025
Dataset authored and provided by
California Department of Water Resourceshttp://www.water.ca.gov/
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
California
Description
This dataset provides population estimate trends from 1998 to the current year for each of California’s 58 counties, further disaggregated by Detailed Analysis Units (DAUs) - the smallest geographic units historically used by the California Department of Water Resources for water planning as part of the California Water Plan. DAUs are subdivisions of Planning Areas and often align with county boundaries, although a single DAU may span multiple counties. They have traditionally supported water demand estimates based on crop and land use types.

The population estimates were developed using U.S. Bureau Census 2000, 2010 and 2020 data. Throughout the estimation process, intermediate results were reviewed and adjusted as needed, with professional judgment applied to smooth trends where appropriate.

Since the California Water Plan is retiring DAUs as its planning and analysis framework, future updates to this dataset will transition away from DAU based geography. Instead, population estimates will be provided based on other geographic units, such as the 8-digit Hydrologic Units (HUC8) defined by the U.S. Geological Survey’s Watershed Boundary Dataset.

A dashboard is available for visualizing historical population trends by county and DAU.
d
ARCHIVED: Mpox Vaccinations Given to SF Residents by Demographics
catalog.data.gov
data.sfgov.org
+2more
Updated Mar 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.sfgov.org (2025). ARCHIVED: Mpox Vaccinations Given to SF Residents by Demographics [Dataset]. https://catalog.data.gov/dataset/mpx-vaccinations-given-to-sf-residents-by-demographics
Explore at:
Dataset updated
Mar 29, 2025
Dataset provided by
data.sfgov.org
Area covered
San Francisco
Description
In early February 2024, we will be retiring the Mpox Vaccinations Given to SF Residents by Demographics dataset. This dataset will be archived and no longer update. A historic record of this data will remain available. A. SUMMARY This dataset represents doses of mpox vaccine (JYNNEOS) administered in California to residents of San Francisco ages 18 years or older. This dataset only includes doses of the JYNNEOS vaccine given on or after 5/1/2022. All vaccines given to people who live in San Francisco are included, no matter where the vaccination took place. The data are broken down by multiple demographic stratifications. B. HOW THE DATASET IS CREATED Information on doses administered to those who live in San Francisco is from the California Immunization Registry (CAIR2), run by the California Department of Public Health (CDPH). Information on individuals’ city of residence, age, race, ethnicity, and sex are recorded in CAIR2 and are self-reported at the time of vaccine administration. Because CAIR2 does not include information on sexual orientation, we pull information from the San Francisco Department of Public Health’s Epic Electronic Health Record (EHR). The populations represented in our Epic data and the CAIR2 data are different. Epic data only include vaccinations administered at SFDPH managed sites to SF residents. Data notes for population characteristic types are listed below. Age * Data only include individuals who are 18 years of age or older. Race/ethnicity * The response option "Other Race" is categorized by the data source system, and the response option "Unknown" refers to a lack of data. Sex * The response option "Other" is categorized by the source system, and the response option "Unknown" refers to a lack of data. Sexual orientation * The response option “Unknown/Declined” refers to a lack of data or individuals who reported multiple different sexual orientations during their most recent interaction with SFDPH. For convenience, we provide the 2020 5-year American Community Survey population estimates. C. UPDATE PROCESS Updated daily via automated process. D. HOW TO USE THIS DATASET This dataset includes many different types of demographic groups. Filter the “demographic_group” column to explore a topic area. Then, the “demographic_subgroup” column shows each group or category within that topic area and the total count of doses administered to that population subgroup. E. CHANGE LOG UPDATE 1/3/2023: Due to low case numbers, this page will no longer include vaccinations after 12/31/2022.
d
NYSERDA Low- to Moderate-Income New York State Census Population Analysis...
catalog.data.gov
datasets.ai
+4more
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ny.gov (2025). NYSERDA Low- to Moderate-Income New York State Census Population Analysis Dataset: Average for 2013-2015 [Dataset]. https://catalog.data.gov/dataset/nyserda-low-to-moderate-income-new-york-state-census-population-analysis-dataset-aver-2013
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
data.ny.gov
Area covered
New York
Description
How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov. The Low- to Moderate-Income (LMI) New York State (NYS) Census Population Analysis dataset is resultant from the LMI market database designed by APPRISE as part of the NYSERDA LMI Market Characterization Study (https://www.nyserda.ny.gov/lmi-tool). All data are derived from the U.S. Census Bureau’s American Community Survey (ACS) 1-year Public Use Microdata Sample (PUMS) files for 2013, 2014, and 2015. Each row in the LMI dataset is an individual record for a household that responded to the survey and each column is a variable of interest for analyzing the low- to moderate-income population. The LMI dataset includes: county/county group, households with elderly, households with children, economic development region, income groups, percent of poverty level, low- to moderate-income groups, household type, non-elderly disabled indicator, race/ethnicity, linguistic isolation, housing unit type, owner-renter status, main heating fuel type, home energy payment method, housing vintage, LMI study region, LMI population segment, mortgage indicator, time in home, head of household education level, head of household age, and household weight. The LMI NYS Census Population Analysis dataset is intended for users who want to explore the underlying data that supports the LMI Analysis Tool. The majority of those interested in LMI statistics and generating custom charts should use the interactive LMI Analysis Tool at https://www.nyserda.ny.gov/lmi-tool. This underlying LMI dataset is intended for users with experience working with survey data files and producing weighted survey estimates using statistical software packages (such as SAS, SPSS, or Stata).
d
San Francisco Population and Demographic Census Data
catalog.data.gov
data.sfgov.org
Updated Mar 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.sfgov.org (2025). San Francisco Population and Demographic Census Data [Dataset]. https://catalog.data.gov/dataset/san-francisco-population-and-demographic-census-data
Explore at:
Dataset updated
Mar 29, 2025
Dataset provided by
data.sfgov.org
Area covered
San Francisco
Description
A. SUMMARY This dataset contains population and demographic estimates and associated margins of error obtained and derived from the US Census. The data is presented over multiple years and geographies. The data is sourced primarily from the American Community Survey. B. HOW THE DATASET IS CREATED The raw data is obtained from the census API. Some estimates as published as-is and some are derived. C. UPDATE PROCESS New estimates and years of data are appended to this dataset. To request additional census data for San Francisco, email support@datasf.org D. HOW TO USE THIS DATASET The dataset is long and contains multiple estimates, years and geographies. To use this dataset, you can filter by the overall segment which contains information about the source, years, geography, demographic category and reporting segment. For census data used in specific reports, you can filter to the reporting segment. To use a subset of the data, you can create a filtered view. More information of how to filter data and create a view can be found here
d
Demographics
catalog.data.gov
datasets.ai
+4more
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lake County Illinois GIS (2024). Demographics [Dataset]. https://catalog.data.gov/dataset/demographics-0be32
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Lake County Illinois GIS
Description
Lake County, Illinois Demographic Data. Explanation of field attributes: Total Population – The entire population of Lake County. White – Individuals who are of Caucasian race. This is a percent.African American – Individuals who are of African American race. This is a percent.Asian – Individuals who are of Asian race. This is a percent. Hispanic – Individuals who are of Hispanic ethnicity. This is a percent. Does not Speak English- Individuals who speak a language other than English in their household. This is a percent. Under 5 years of age – Individuals who are under 5 years of age. This is a percent. Under 18 years of age – Individuals who are under 18 years of age. This is a percent. 18-64 years of age – Individuals who are between 18 and 64 years of age. This is a percent. 65 years of age and older – Individuals who are 65 years old or older. This is a percent. Male – Individuals who are male in gender. This is a percent. Female – Individuals who are female in gender. This is a percent. High School Degree – Individuals who have obtained a high school degree. This is a percent. Associate Degree – Individuals who have obtained an associate degree. This is a percent. Bachelor’s Degree or Higher – Individuals who have obtained a bachelor’s degree or higher. This is a percent. Utilizes Food Stamps – Households receiving food stamps/ part of SNAP (Supplemental Nutrition Assistance Program). This is a percent. Median Household Income - A median household income refers to the income level earned by a given household where half of the homes in the area earn more and half earn less. This is a dollar amount. No High School – Individuals who have not obtained a high school degree. This is a percent. Poverty – Poverty refers to families and people whose income in the past 12 months is below the poverty level. This is a percent.
Data from: Population Assessment of Tobacco and Health (PATH) Study [United...
icpsr.umich.edu
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inter-university Consortium for Political and Social Research [distributor] (2025). Population Assessment of Tobacco and Health (PATH) Study [United States] Restricted-Use Files [Dataset]. http://doi.org/10.3886/ICPSR36231.v42
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR36231.v42
Dataset updated
Jun 27, 2025
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
License
https://www.icpsr.umich.edu/web/ICPSR/studies/36231/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36231/terms
Area covered
United States
Description
The PATH Study was launched in 2011 to inform the Food and Drug Administration's regulatory activities under the Family Smoking Prevention and Tobacco Control Act (TCA). The PATH Study is a collaboration between the National Institute on Drug Abuse (NIDA), National Institutes of Health (NIH), and the Center for Tobacco Products (CTP), Food and Drug Administration (FDA). The study sampled over 150,000 mailing addresses across the United States to create a national sample of people who use or do not use tobacco. 45,971 adults and youth constitute the first (baseline) wave, Wave 1, of data collected by this longitudinal cohort study. These 45,971 adults and youth along with 7,207 "shadow youth" (youth ages 9 to 11 sampled at Wave 1) make up the 53,178 participants that constitute the Wave 1 Cohort. Respondents are asked to complete an interview at each follow-up wave. Youth who turn 18 by the current wave of data collection are considered "aged-up adults" and are invited to complete the Adult Interview. Additionally, "shadow youth" are considered "aged-up youth" upon turning 12 years old, when they are asked to complete an interview after parental consent. At Wave 4, a probability sample of 14,098 adults, youth, and shadow youth ages 10 to 11 was selected from the civilian, noninstitutionalized population (CNP) at the time of Wave 4. This sample was recruited from residential addresses not selected for Wave 1 in the same sampled Primary Sampling Unit (PSU)s and segments using similar within-household sampling procedures. This "replenishment sample" was combined for estimation and analysis purposes with Wave 4 adult and youth respondents from the Wave 1 Cohort who were in the CNP at the time of Wave 4. This combined set of Wave 4 participants, 52,731 participants in total, forms the Wave 4 Cohort. At Wave 7, a probability sample of 14,863 adults, youth, and shadow youth ages 9 to 11 was selected from the CNP at the time of Wave 7. This sample was recruited from residential addresses not selected for Wave 1 or Wave 4 in the same sampled PSUs and segments using similar within-household sampling procedures. This "second replenishment sample" was combined for estimation and analysis purposes with the Wave 7 adult and youth respondents from the Wave 4 Cohorts who were at least age 15 and in the CNP at the time of Wave 7. This combined set of Wave 7 participants, 46,169 participants in total, forms the Wave 7 Cohort. Please refer to the Restricted-Use Files User Guide that provides further details about children designated as "shadow youth" and the formation of the Wave 1, Wave 4, and Wave 7 Cohorts. Dataset 0002 (DS0002) contains the data from the State Design Data. This file contains 7 variables and 82,139 cases. The state identifier in the State Design file reflects the participant's state of residence at the time of selection and recruitment for the PATH Study. Dataset 1011 (DS1011) contains the data from the Wave 1 Adult Questionnaire. This data file contains 2,021 variables and 32,320 cases. Each of the cases represents a single, completed interview. Dataset 1012 (DS1012) contains the data from the Wave 1 Youth and Parent Questionnaire. This file contains 1,431 variables and 13,651 cases. Dataset 1411 (DS1411) contains the Wave 1 State Identifier data for Adults and has 5 variables and 32,320 cases. Dataset 1412 (DS1412) contains the Wave 1 State Identifier data for Youth (and Parents) and has 5 variables and 13,651 cases. The same 5 variables are in each State Identifier dataset, including PERSONID for linking the State Identifier to the questionnaire and biomarker data and 3 variables designating the state (state Federal Information Processing System (FIPS), state abbreviation, and full name of the state). The State Identifier values in these datasets represent participants' state of residence at the time of Wave 1, which is also their state of residence at the time of recruitment. Dataset 1611 (DS1611) contains the Tobacco Universal Product Code (UPC) data from Wave 1. This data file contains 32 variables and 8,601 cases. This file contains UPC values on the packages of tobacco products used or in the possession of adult respondents at the time of Wave 1. The UPC values can be used to identify and validate the specific products used by respondents and augment the analyses of the characteristics of tobacco products used
World Population & Health Data 2014 - 2024
kaggle.com
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faizal Rosyid (2025). World Population & Health Data 2014 - 2024 [Dataset]. https://www.kaggle.com/datasets/faizalrosyid/world-population-and-health-data-2014-2024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 21, 2025
Dataset provided by
Kaggle
Authors
Faizal Rosyid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
World
Description
This dataset provides an extensive view of global population statistics and health metrics across various countries from 2014 to 2024. It combines population data with vital health-related indicators, making it a valuable resource for understanding trends in population growth and health outcomes worldwide. Researchers, data scientists, and policymakers can utilize this dataset to analyze correlations between population dynamics and health performance at a global scale.

Key Features: - Country: Name of the country. - Year: Year of the data (2014–2024). - Population: Total population for the respective year and country. - Country Code: ISO 3-letter country codes for easy identification. - Health Expenditure (health_exp): Percentage of GDP spent on healthcare. - Life Expectancy (life_expect): Average life expectancy at birth in years. - Maternal Mortality (maternal_mortality): Maternal deaths per 100,000 live births. - Infant Mortality (infant_mortality): Deaths of infants under 1 year per 1,000 live births. - Neonatal Mortality (neonatal_mortality): Deaths of newborns (0–28 days) per 1,000 live births. - Under-5 Mortality (under_5_mortality): Deaths of children under 5 years per 1,000 live births. - HIV Prevalence (prev_hiv): Percentage of the population living with HIV. - Tuberculosis Incidence (inci_tuberc): Estimated new and relapse TB cases per 100,000 people. - Undernourishment Prevalence (prev_undernourishment): Percentage of the population that is undernourished.

Use Cases: - Health Policy Analysis: Understand trends in healthcare expenditure and its relationship to health outcomes. - Global Health Research: Investigate global or regional disparities in health and nutrition. - Population Studies: Analyze population growth trends alongside health indicators. - Data Visualization: Build visual dashboards for storytelling and impactful data representation.
u
Annual Population Survey: Well-Being, April 2011 - March 2015: Secure Access...
beta.ukdataservice.ac.uk
Updated 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Survey Division Office For National Statistics (2016). Annual Population Survey: Well-Being, April 2011 - March 2015: Secure Access [Dataset]. http://doi.org/10.5255/ukda-sn-7961-1
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-7961-1
Dataset updated
2016
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Authors
Social Survey Division Office For National Statistics
Description
The Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS) (held at the UK Data Archive under GN 33246), all of its associated LFS boosts and the APS boost. Thus, the APS combines results from five different sources: the LFS (waves 1 and 5); the English Local Labour Force Survey (LLFS), the Welsh Labour Force Survey (WLFS), the Scottish Labour Force Survey (SLFS) and the Annual Population Survey Boost Sample (APS(B) - however, this ceased to exist at the end of December 2005, so APS data from January 2006 onwards will contain all the above data apart from APS(B)). Users should note that the LLFS, WLFS, SLFS and APS(B) are not held separately at the UK Data Archive. For further detailed information about methodology, users should consult the Labour Force Survey User Guide, selected volumes of which have been included with the APS documentation for reference purposes (see 'Documentation' table below).

The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples such as the WLFS and SLFS, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.

APS Well-Being data
Since April 2011, the APS has included questions about personal and subjective well-being. The responses to these questions have been made available as annual sub-sets to the APS Person level files. It is important to note that the size of the achieved sample of the well-being questions within the dataset is approximately 165,000 people. This reduction is due to the well-being questions being only asked of persons aged 16 and above, who gave a personal interview and proxy answers are not accepted. As a result some caution should be used when using analysis of responses to well-being questions at detailed geography areas and also in relation to any other variables where respondent numbers are relatively small. It is recommended that for lower level geography analysis that the variable UACNTY09 is used.

As well as annual datasets, three-year pooled datasets are available. When combining multiple APS datasets together, it is important to account for the rotational design of the APS and ensure that no person appears more than once in the multiple year dataset. This is because the well-being datasets are not designed to be longitudinal e.g. they are not designed to track individuals over time/be used for longitudinal analysis. They are instead cross-sectional, and are designed to use a cross-section of the population to make inferences about the whole population. For this reason, the three-year dataset has been designed to include only a selection of the cases from the individual year APS datasets, chosen in such a way that no individuals are included more than once, and the cases included are approximately equally spread across the three years. Further information is available in the 'Documentation' section below.

Secure Access APS Well-Being data
Secure Access datasets for the APS Well-Being include additional variables not included in either the standard End User Licence (EUL) versions (see under GN 33357) or the Special Licence (SL) access versions (see under GN 33376). Extra variables that typically can be found in the Secure Access version but not in the EUL or SL versions relate to:
geography, including:
Postcodes
Census Area Statistics (CAS) Wards
Census Output Areas
Nomenclature of Units for Territorial Statistics (NUTS) level 2 and 3 areas
Lower and Middle Layer Super Output Areas
Travel to Work Areas
Unitary authority / Local Authority District of place of work (main job)
region of place of work for first and second jobs
qualifications, education and training including level of highest qualification, qualifications from Government schemes, qualifications related to work, qualifications from school, qualifications from university of college and qualifications gained from outside the UK
detailed ethnic group for Scottish respondents
detailed religious denomination for Northern Irish respondents
length health problem has limited activity
learning difficulty or learning disability
occupation in apprenticeship or second job
number of bedrooms
number of dependent children in household aged under 19
Prospective users of the Secure Access version of the APS Well-Being will need to fulfil additional requirements, commencing with the completion of an extra application form to demonstrate to the data owners exactly why they need access to the extra, more detailed variables, in order to obtain permission to use that version. Secure Access data users must also complete face-to-face training and agree to the Secure Access User Agreement and Licence Compliance Policy (see 'Access' section below). Therefore, users are encouraged to download and inspect the EUL version of the data prior to ordering the Secure Access (or SL) version. Further details and links to all APS studies available from the UK Data Archive can be found via the APS Key Data series webpage.

APS Well-Being Datasets: Information, July 2016
From 2012-2015, the ONS published separate APS datasets aimed at providing initial estimates of subjective well-being, based on the Integrated Household Survey. In 2015 these were discontinued. A separate set of well-being variables and a corresponding weighting variable have been added to the April-March APS person datasets from A11M12 onwards. Users should no longer use the bespoke well-being datasets (SNs 6994, 6999, 7091, 7092, 7364, 7365, 7565, 7566 and 7961, but should now use the variables included on the April-March APS person datasets instead. Further information on the transition can be found on the Personal well-being in the UK: 2015 to 2016

Documentation and coding frames
The APS is compiled from variables present in the LFS. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation (e.g. coding frames for education, industrial and geographic variables, which are held in LFS User Guide Vol.5, Classifications), users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.

May 2018 Update
Due to a change in the Travel-to-Work Area coding structure from 2001 to 2011, the variable TTWA9D has been relabelled in the pooled data file for 2012-2015.
Z
Data from: A 24-hour dynamic population distribution dataset based on mobile...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Feb 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olle Järv (2022). A 24-hour dynamic population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4724388
Explore at:
Dataset updated
Feb 16, 2022
Dataset provided by
Claudia Bergroth
Tuuli Toivonen
Henrikki Tenkanen
Matti Manninen
Olle Järv
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Helsinki Metropolitan Area, Finland
Description
Related article: Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39.

In this dataset:

We present temporally dynamic population distribution data from the Helsinki Metropolitan Area, Finland, at the level of 250 m by 250 m statistical grid cells. Three hourly population distribution datasets are provided for regular workdays (Mon – Thu), Saturdays and Sundays. The data are based on aggregated mobile phone data collected by the biggest mobile network operator in Finland. Mobile phone data are assigned to statistical grid cells using an advanced dasymetric interpolation method based on ancillary data about land cover, buildings and a time use survey. The data were validated by comparing population register data from Statistics Finland for night-time hours and a daytime workplace registry. The resulting 24-hour population data can be used to reveal the temporal dynamics of the city and examine population variations relevant to for instance spatial accessibility analyses, crisis management and planning.

Please cite this dataset as:

Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39. https://doi.org/10.1038/s41597-021-01113-4

Organization of data

The dataset is packaged into a single Zipfile Helsinki_dynpop_matrix.zip which contains following files:

HMA_Dynamic_population_24H_workdays.csv represents the dynamic population for average workday in the study area.

HMA_Dynamic_population_24H_sat.csv represents the dynamic population for average saturday in the study area.

HMA_Dynamic_population_24H_sun.csv represents the dynamic population for average sunday in the study area.

target_zones_grid250m_EPSG3067.geojson represents the statistical grid in ETRS89/ETRS-TM35FIN projection that can be used to visualize the data on a map using e.g. QGIS.

Column names

YKR_ID : a unique identifier for each statistical grid cell (n=13,231). The identifier is compatible with the statistical YKR grid cell data by Statistics Finland and Finnish Environment Institute.

H0, H1 ... H23 : Each field represents the proportional distribution of the total population in the study area between grid cells during a one-hour period. In total, 24 fields are formatted as “Hx”, where x stands for the hour of the day (values ranging from 0-23). For example, H0 stands for the first hour of the day: 00:00 - 00:59. The sum of all cell values for each field equals to 100 (i.e. 100% of total population for each one-hour period)

In order to visualize the data on a map, the result tables can be joined with the target_zones_grid250m_EPSG3067.geojson data. The data can be joined by using the field YKR_ID as a common key between the datasets.

License Creative Commons Attribution 4.0 International.

Related datasets

Järv, Olle; Tenkanen, Henrikki & Toivonen, Tuuli. (2017). Multi-temporal function-based dasymetric interpolation tool for mobile phone data. Zenodo. https://doi.org/10.5281/zenodo.252612

Tenkanen, Henrikki, & Toivonen, Tuuli. (2019). Helsinki Region Travel Time Matrix [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3247564
International Datasets
kaggle.com
Updated Jun 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Census Bureau (2017). International Datasets [Dataset]. https://www.kaggle.com/census/international-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
US Census Bureau
Description
Content

The United States Census Bureau’s International Dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the data set includes midyear population figures broken down by age and gender assignment at birth. Additionally, they provide time-series data for attributes including fertility rates, birth rates, death rates, and migration rates.

The full documentation is available here. For basic field details, please see the data dictionary.

Note: The U.S. Census Bureau provides estimates and projections for countries and areas that are recognized by the U.S. Department of State that have a population of at least 5,000.

Acknowledgements

This dataset was created by the United States Census Bureau.

Inspiration

Which countries have made the largest improvements in life expectancy? Based on current trends, how long will it take each country to catch up to today’s best performers?

Use this dataset with BigQuery

You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data on BigQuery, too: https://cloud.google.com/bigquery/public-data/international-census.
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
f
ReferenceUSA Historical Consumer Datasets
arizona.figshare.com
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Arizona Libraries (2024). ReferenceUSA Historical Consumer Datasets [Dataset]. http://doi.org/10.25422/azu.data.26222102.v1
Explore at:
Unique identifier
https://doi.org/10.25422/azu.data.26222102.v1
Dataset updated
Aug 6, 2024
Dataset provided by
University of Arizona Research Data Repository
Authors
University of Arizona Libraries
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
Dataset available only to University of Arizona affiliates. To obtain access, you must log in to ReDATA with your NetID. Data is for research use by each individual downloader only. Sharing and/or redistribution of any portion of this dataset is prohibited.This ReferenceUSA dataset from Data Axle (formerly Infogroup) contains household data about US consumers in annual snapshots from 2006-2021. It includes details such as family demographics, income, home ownership status, lifestyle, location and more, which can help users to create marketing plans and conduct competitive analyses.Consumer profiles are described with 58-66 indicators. Data for all states are combined into single files for each year between 2006 and 2012 while there is a file for each state in 2013-2021. The Layout - Consumer DB Historical 2006-2012.xlsx in Documentation.zip applies to 2006-2012. Codebooks for 2013, 2014, 2015, 2017, 2018, 2019 and 2021 are not included but files in 2013-2021 have similar layouts therefore 2016 Historical Residential File Layout.xlsx and 2020 Historical Residential File Layout.xlsx in Documentation.zip apply to 2013-2021.The University of Arizona University Libraries also subscribe to Data Axle Reference Solutions which provides this data in a searchable, online database with historical data available going back to 2003.NOTE: The uncompressed datasets are very large.Detailed file descriptions and MD5 hash values for each file can be found in the README.txt file.For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu
o
Mid-year population estimates - Dataset - Open Data NI
admin.opendatani.gov.uk
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Mid-year population estimates - Dataset - Open Data NI [Dataset]. https://admin.opendatani.gov.uk/dataset/mye01t09
Explore at:
Dataset updated
Jul 9, 2024
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Description of Data Notes: The estimates are produced using a variety of data sources and statistical models. Therefore small estimates should not be taken to refer to particular individuals. The migration element of the components of change have been largely derived from a data source which is known to be deficient in recording young adult males and outflows from Northern Ireland. Therefore the estimates are subject to adjustment to account for this and, while deemed acceptable for their use, will not provide definitive numbers of the population in the reported groups/areas. Further information is available in the Limitations section of the statistical bulletin: NISRA 2023 Mid-year Population Estimates webpage Time Period Estimates are provided for mid-2001 to mid-2023. Methodology The cohort-component method was used to create the population estimates for 2023. This method updates the Census estimates by 'ageing on' populations and applying information on births, deaths and migration. Further information is available at: NISRA 2023 Mid-year Population Estimates webpage Geographic Referencing Population Estimates are based on a large number of secondary datasets. Where the full address was available, the Pointer Address database was used to allocate a unique property reference number (UPRN) and geo-spatial co-ordinates to each home address. These can then be used to map the address to particular geographies. Where it was not possible to assign a unique property reference number to an address using the Pointer database, or where the secondary dataset contained only postcode information, the Central Postcode Directory was used to map home address postcodes to higher geographies. A small proportion of records with unknown geography were apportioned based on the spatial characteristics of known records. Further Information NISRA Mid-year Population Estimates webpage Contact: NISRA Customer Services 02890 255156 census@nisra.gov.uk Responsible Statistician: Shauna Dunlop
D
SFDPH reporting - demographics population estimates
data.sfgov.org
csv, xlsx, xml
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
American Community Survey (2025). SFDPH reporting - demographics population estimates [Dataset]. https://data.sfgov.org/widgets/cedd-86uf?mobile_redirect=true
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Mar 27, 2025
Dataset authored and provided by
American Community Survey
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
This filtered view contains the population estimates for San Francisco demographic groups from the U.S. Census Bureau’s American Community Survey that are used in the Department of Public Health’s public reporting. Details on the underlying demographic data from the American Community Survey are available below. The demographics included are race/ethnicity and age groups. Different age groups are used for reporting on cases reporting versus vaccinations. The specific groups used in each of these reports can be found by using the "reporting_segment" column. We are using 2016-2020 ACS estimates in our public reporting, but additional years are included in this view as well for historical purposes.

The COVID-19 reports which use this data are available on SF.gov by clicking here.

San Francisco Population and Demographic Census data dataset filtered on:
"reporting_segment" =
'COVID-19 cases/testing reporting - age brackets'
OR 'Seasonal vaccine reporting - age groups'
OR 'from census table B03002' (race/ethnicity)
AND "geography_name" = 'San Francisco County, California'
A. SUMMARY This dataset contains population and demographic estimates and associated margins of error obtained and derived from the US Census. The data is presented over multiple years and geographies. The data is sourced primarily from the American Community Survey.

B. HOW THE DATASET IS CREATED The raw data is obtained from the census API. Some estimates as published as-is and some are derived.

C. UPDATE PROCESS New estimates and years of data are appended to this dataset. To request additional census data for San Francisco, email support@datasf.org

D. HOW TO USE THIS DATASET The dataset is long and contains multiple estimates, years and geographies. To use this dataset, you can filter by the overall segment which contains information about the source, years, geography, demographic category and reporting segment. For census data used in specific reports, you can filter to the reporting segment. To use a subset of the data, you can create a filtered view. More information of how to filter data and create a view can be found here
R
Demographic Dataset
universe.roboflow.com
zip
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BaoDungVu (2025). Demographic Dataset [Dataset]. https://universe.roboflow.com/baodungvu/demographic-zeevc/dataset/7
Explore at:
zipAvailable download formats
Dataset updated
Apr 3, 2025
Dataset authored and provided by
BaoDungVu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Age Gender
Description
Demographic

## Overview Demographic is a dataset for classification tasks - it contains Age Gender annotations for 200 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Global Population Count Grid Time Series Estimates - Dataset - NASA Open...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Apr 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Global Population Count Grid Time Series Estimates - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/global-population-count-grid-time-series-estimates
Explore at:
Dataset updated
Apr 23, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Global Population Count Grid Time Series Estimates provide a back-cast time series of population grids based on the year 2000 population grid from SEDAC's Global Rural-Urban Mapping Project, Version 1 (GRUMPv1) data set. The grids were created by using rates of population change between decades from the coarser resolution History Database of the Global Environment (HYDE) database to back-cast the GRUMPv1 population count grids. Mismatches between the spatial extent of the HYDE calculated rates and GRUMPv1 population data were resolved via infilling rate cells based on a focal mean of values. Finally, the grids were adjusted so that the population totals for each country equaled the UN World Population Prospects (2008 Revision) estimates for that country for the respective year (1970, 1980, 1990, and 2000). These data do not represent census observations for the years prior to 2000, and therefore can at best be thought of as estimations of the populations in given locations. The population grids are consistent internally within the time series, but are not recommended for use in creating longer time series with any other population grids, including GRUMPv1, Gridded Population of the World, Version 4 (GPWv4), or non-SEDAC developed population grids. These population grids served as an input to SEDAC's Global Estimated Net Migration Grids by Decade: 1970-2000 data set.
L2 Voter and Demographic Dataset
redivis.com
stanford.redivis.com
application/jsonl +7
Updated Aug 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University Libraries (2025). L2 Voter and Demographic Dataset [Dataset]. http://doi.org/10.57761/jnrs-nf57
Explore at:
sas, arrow, csv, parquet, application/jsonl, spss, avro, stataAvailable download formats
Unique identifier
https://doi.org/10.57761/jnrs-nf57
Dataset updated
Aug 5, 2025
Dataset provided by
Redivis Inc.
Authors
Stanford University Libraries
Description
Abstract

The L2 Voter and Demographic Dataset includes demographic and voter history tables for all 50 states and the District of Columbia. The dataset is built from publicly available government records about voter registration and election participation. These records indicate whether a person voted in an election or not, but they do not record whom that person voted for. Voter registration and election participation data are augmented by demographic information from outside data sources.

Methodology

To create this file, L2 processes registered voter data on an ongoing basis for all 50 states and the District of Columbia, with refreshes of the underlying state voter data typically at least every six months and refreshes of telephone numbers and National Change of Address processing approximately every 30 to 60 days. These data are standardized and enhanced with propriety commercial data and modeling codes and consist of approximately 185,000,000 records nationwide.

Usage

For each state, there are two available tables: demographic and voter history. The demographic and voter tables can be joined on the LALVOTERIDvariable. One can also use the LALVOTERIDvariable to link the L2 Voter and Demographic Dataset with the L2 Consumer Dataset.

In addition, the LALVOTERIDvariable can be used to validate the state. For example, let's look at the LALVOTERID = LALCA3169443. The characters in the fourth and fifth positions of this identifier are 'CA' (California). The second way to validate the state is by using the RESIDENCE_ADDRESSES_STATEvariable, which should have a value of 'CA' (California).

The date appended to each table name represents when the data was last updated. These dates will differ state by state because states update their voter files at different cadences.

The demographic files use 698 consistent variables. For more information about these variables, see 2025-01-10-VM2-File-Layout.xlsx.

The voter history files have different variables depending on the state. The ***2025-08-05-L2-Voter-Dictionaries.tar.gz file contains .csv data dictionaries for each state's demographic and voter files. While the demographic file data dictionaries should mirror the 2025-01-10-VM2-File-Layout.xlsx*** file, the voter file data dictionaries will be unique to each state.

***2025-04-24-National-File-Notes.pdf ***contains L2 Voter and Demographic Dataset ("National File") release notes from 2018 to 2025.

***2025-08-05-L2-Voter-Fill-Rate.tar.gz ***contains .tab files tracking the percent of non-null values for any given field.

Bulk Data Access

Data access is required to view this section.

DataMapping Tool

Data access is required to view this section.
Predicting Credit Card Customer Segmentation
kaggle.com
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2024). Predicting Credit Card Customer Segmentation [Dataset]. https://www.kaggle.com/datasets/thedevastator/predicting-credit-card-customer-attrition-with-m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2024
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Predicting Credit Card Customer Segmentation

Exploring Key Customer Characteristics

By [source]

About this dataset

This dataset contains a wealth of customer information collected from within a consumer credit card portfolio, with the aim of helping analysts predict customer attrition. It includes comprehensive demographic details such as age, gender, marital status and income category, as well as insight into each customer’s relationship with the credit card provider such as the card type, number of months on book and inactive periods. Additionally it holds key data about customers’ spending behavior drawing closer to their churn decision such as total revolving balance, credit limit, average open to buy rate and analyzable metrics like total amount of change from quarter 4 to quarter 1, average utilization ratio and Naive Bayes classifier attrition flag (Card category is combined with contacts count in 12months period alongside dependent count plus education level & months inactive). Faced with this set of useful predicted data points across multiple variables capture up-to-date information that can determine long term account stability or an impending departure therefore offering us an equipped understanding when seeking to manage a portfolio or serve individual customers

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to analyze the key factors that influence customer attrition. Analysts can use this dataset to understand customer demographics, spending patterns, and relationship with the credit card provider to better predict customer attrition.

Research Ideas

Using the customer demographics, such as gender, marital status, education level and income category to determine which customer demographic is more likely to churn.

Analyzing the customer’s spending behavior leading up to churning and using this data to better predict the likelihood of a customer of churning in the future.

Creating a classifier that can predict potential customers who are more susceptible to attrition based on their credit score, credit limit, utilization ratio and other spending behavior metrics over time; this could be used as an early warning system for predicting potential attrition before it happens

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: BankChurners.csv | Column name | Description | |:---------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------| | CLIENTNUM | Unique identifier for each customer. (Integer) | | Attrition_Flag | Flag indicating whether or not the customer has churned out. (Boolean) | | Customer_Age | Age of customer. (Integer) | | Gender | Gender of customer. (String) | | Dependent_count | Number of dependents that customer has. (Integer) | | Education_Level ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Wei Luo; Thin Nguyen; Melanie Nichols; Truyen Tran; Santu Rana; Sunil Gupta; Dinh Phung; Svetha Venkatesh; Steve Allender (2023). Is Demography Destiny? Application of Machine Learning Techniques to Accurately Predict Population Health Outcomes from a Minimal Demographic Dataset [Dataset]. http://doi.org/10.1371/journal.pone.0125602

Is Demography Destiny? Application of Machine Learning Techniques to Accurately Predict Population Health Outcomes from a Minimal Demographic Dataset

Explore at:

28 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0125602

Dataset updated

Jun 3, 2023

Dataset provided by

PLOS ONE

Authors

Wei Luo; Thin Nguyen; Melanie Nichols; Truyen Tran; Santu Rana; Sunil Gupta; Dinh Phung; Svetha Venkatesh; Steve Allender

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

For years, we have relied on population surveys to keep track of regional public health statistics, including the prevalence of non-communicable diseases. Because of the cost and limitations of such surveys, we often do not have the up-to-date data on health outcomes of a region. In this paper, we examined the feasibility of inferring regional health outcomes from socio-demographic data that are widely available and timely updated through national censuses and community surveys. Using data for 50 American states (excluding Washington DC) from 2007 to 2012, we constructed a machine-learning model to predict the prevalence of six non-communicable disease (NCD) outcomes (four NCDs and two major clinical risk factors), based on population socio-demographic characteristics from the American Community Survey. We found that regional prevalence estimates for non-communicable diseases can be reasonably predicted. The predictions were highly correlated with the observed data, in both the states included in the derivation model (median correlation 0.88) and those excluded from the development for use as a completely separated validation sample (median correlation 0.85), demonstrating that the model had sufficient external validity to make good predictions, based on demographics alone, for areas not included in the model development. This highlights both the utility of this sophisticated approach to model development, and the vital importance of simple socio-demographic characteristics as both indicators and determinants of chronic disease.

Clear search

Close search

Google apps

Main menu

Is Demography Destiny? Application of Machine Learning Techniques to...

LivWell: a sub-national database on the Living conditions of Women and their...

California Population Trends by Geography

ARCHIVED: Mpox Vaccinations Given to SF Residents by Demographics

NYSERDA Low- to Moderate-Income New York State Census Population Analysis...

San Francisco Population and Demographic Census Data

Demographics

Data from: Population Assessment of Tobacco and Health (PATH) Study [United...

World Population & Health Data 2014 - 2024

Annual Population Survey: Well-Being, April 2011 - March 2015: Secure Access...

Data from: A 24-hour dynamic population distribution dataset based on mobile...

International Datasets

Content

Acknowledgements

Inspiration

Use this dataset with BigQuery

Current Population Survey (CPS)

ReferenceUSA Historical Consumer Datasets

Mid-year population estimates - Dataset - Open Data NI

SFDPH reporting - demographics population estimates

Demographic Dataset

Demographic

Global Population Count Grid Time Series Estimates - Dataset - NASA Open...

L2 Voter and Demographic Dataset

Abstract

Methodology

Usage

Bulk Data Access

DataMapping Tool

Predicting Credit Card Customer Segmentation

Predicting Credit Card Customer Segmentation

Exploring Key Customer Characteristics

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Is Demography Destiny? Application of Machine Learning Techniques to Accurately Predict Population Health Outcomes from a Minimal Demographic DatasetSee More Versions

Is Demography Destiny? Application of Machine Learning Techniques to Accurately Predict Population Health Outcomes from a Minimal Demographic Dataset