Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis of the exposome and secular trends on cancer mortality. csv Data Record: The curated NHANES datasets and the data dictionaries includes 13 .csv files and 1 excel file. The curated NHANES datasets involves 10 .csv formatted files, one for each module and labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. The eleventh file is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 4,740 variables in NHANES ("dictionary_nhanes.csv"). The 12th csv file contains the harmonized categories for the categorical variables ("dictionary_harmonized_categories.csv"). The 13th file contains the dictionary for descriptors on the drugs codes (“dictionary_drug_codes.csv”). The 14th file is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES datasets (“nhanes_inconsistencies_documentation.xlsx”). R Data Record: For researchers who want to conduct their analysis in the R programming language, the curated NHANES datasets and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. We provided an .RData file that contains all the aforementioned datasets as R data objects (“w - nhanes_1988_2018.RData”). Also in this .RData file, we make available all R scripts on customized functions that were written to curate the data. We also provide an .R file that shows how we used the customized functions (i.e. our pipeline) to curate the data (“m - nhanes_1988_2018.R”).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset, Survey-SR, provides the nutrient data for assessing dietary intakes from the national survey What We Eat In America, National Health and Nutrition Examination Survey (WWEIA, NHANES). Historically, USDA databases have been used for national nutrition monitoring (1). Currently, the Food and Nutrient Database for Dietary Studies (FNDDS) (2), is used by Food Surveys Research Group, ARS, to process dietary intake data from WWEIA, NHANES. Nutrient values for FNDDS are based on Survey-SR. Survey-SR was referred to as the "Primary Data Set" in older publications. Early versions of the dataset were composed mainly of commodity-type items such as wheat flour, sugar, milk, etc. However, with increased consumption of commercial processed and restaurant foods and changes in how national nutrition monitoring data are used (1), many commercial processed and restaurant items have been added to Survey-SR.
The current version, Survey-SR 2013-2014, is mainly based on the USDA National Nutrient Database for Standard Reference (SR) 28 (2) and contains sixty-six nutrientseach for 3,404 foods. These nutrient data will be used for assessing intake data from WWEIA, NHANES 2013-2014. Nutrient profiles were added for 265 new foods and updated for about 500 foods from the version used for the previous survey (WWEIA, NHANES 2011-12). New foods added include mainly commercially processed foods such as several gluten-free products, milk substitutes, sauces and condiments such as sriracha, pesto and wasabi, Greek yogurt, breakfast cereals, low-sodium meat products, whole grain pastas and baked products, and several beverages including bottled tea and coffee, coconut water, malt beverages, hard cider, fruit-flavored drinks, fortified fruit juices and fruit and/or vegetable smoothies. Several school lunch pizzas and chicken products, fast-food sandwiches, and new beef cuts were also added, as they are now reported more frequently by survey respondents. Nutrient profiles were updated for several commonly consumed foods such as cheddar, mozzarella and American cheese, ground beef, butter, and catsup. The changes in nutrient values may be due to reformulations in products, changes in the market shares of brands, or more accurate data. Examples of more accurate data include analytical data, market share data, and data from a nationally representative sample. Resources in this dataset:Resource Title: USDA National Nutrient Database for Standard Reference Dataset for What We Eat In America, NHANES 2013-14 (Survey SR 2013-14). File Name: SurveySR_2013_14 (1).zipResource Description: Access database downloaded on November 16, 2017. US Department of Agriculture, Agricultural Research Service, Nutrient Data Laboratory. USDA National Nutrient Database for Standard Reference Dataset for What We Eat In America, NHANES (Survey-SR), October 2015. Resource Title: Data Dictionary. File Name: SurveySR_DD.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundHuman papillomavirus (HPV) infection is an important carcinogenic infection highly prevalent among many populations. However, independent influencing factors and predictive models for HPV infection in both U.S. and Beijing females are rarely confirmed. In this study, our first objective was to explore the overlapping HPV infection-related factors in U.S. and Beijing females. Secondly, we aimed to develop an R package for identifying the top-performing prediction models and build the predictive models for HPV infection using this R package.MethodsThis cross-sectional study used data from the 2009–2016 NHANES (a national population-based study) and the 2019 data on Beijing female union workers from various industries. Prevalence, potential influencing factors, and predictive models for HPV infection in both cohorts were explored.ResultsThere were 2,259 (NHANES cohort, age: 20–59 years) and 1,593 (Beijing female cohort, age: 20–70 years) participants included in analyses. The HPV infection rate of U.S. NHANES and Beijing females were, respectively 45.73 and 8.22%. The number of male sex partners, marital status, and history of HPV infection were the predominant factors that influenced HPV infection in both NHANES and Beijing female cohorts. However, condom application was not an independent influencing factor for HPV infection in both cohorts. R package Modelbest was established. The nomogram developed based on Modelbest package showed better performance than the nomogram which only included significant factors in multivariate regression analysis.ConclusionCollectively, despite the widespread availability of HPV vaccines, HPV infection is still prevalent. Compared with condom promotion, avoidance of multiple sexual partners seems to be more effective for preventing HPV infection. Nomograms developed based on Modelbest can provide improved personalized risk assessment for HPV infection. Our R package Modelbest has potential to be a powerful tool for future predictive model studies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMelanoma is the fourth leading cause of cancer-related death worldwide. The continuous exploration and reporting of risk factors of melanoma is important for standardizing and reducing the incidence of the disease. Calcium signaling is a promising therapeutic target for melanoma; however, the relationship between total serum calcium levels and melanoma development remains unclear.MethodsIn this study, we included patients with melanoma from the National Health and Nutrition Examination Survey (NHANES) database from 2003 to 2006 and from 2009 to 2016. The baseline clinical characteristics of the participants were analyzed using the chi-square and rank-sum tests. Subsequently, a fitted model was constructed to evaluate the relationship between total serum calcium levels and melanoma development. The performance of total serum calcium levels and covariates in predicting the risk of melanoma was assessed based on ROC curves. Finally, LASSO regression analysis was performed using the “glmnet” R package to identify clinical characteristics associated with melanoma.ResultsA total of 13,432 participants were included in this study. Age, race, household poverty-to-income ratio, response of the skin to sunlight after a certain period of non-exposure, wearing long-sleeved shirts, frequency of sunscreen use, and arthritis were significantly correlated with the development of melanoma. The p-values of total serum calcium levels in three fitted models were
Population-based county-level estimates for diagnosed (DDP), undiagnosed (UDP), and total diabetes prevalence (TDP) were acquired from the Institute for Health Metrics and Evaluation (IHME) for the years 2004-2012 (Evaluation 2017). Prevalence estimates were calculated using a two-stage approach. The first stage used National Health and Nutrition Examination Survey (NHANES) data to predict high fasting plasma glucose (FPG) levels (≥126 mg/dL) and/or hemoglobin A1C (HbA1C) levels (≥6.5% [48 mmol/mol]) based on self-reported demographic and behavioral characteristics (Dwyer-Lindgren, Mackenbach et al. 2016). This model was then applied to Behavioral Risk Factor Surveillance System (BRFSS) data to impute high FPG and/or A1C status for each BRFSS respondent (Dwyer-Lindgren, Mackenbach et al. 2016). The second stage used the imputed BRFSS data to fit a series of small area models, which were used to predict the county-level prevalence of each of the diabetes-related outcomes (Dwyer-Lindgren, Mackenbach et al. 2016). Diagnosed diabetes was defined as the proportion of adults (age 20+ years) who reported a previous diabetes diagnosis, represented as an age-standardized prevalence percentage. Undiagnosed diabetes was defined as proportion of adults (age 20+ years) who have a high FPG or HbA1C but did not report a previous diagnosis of diabetes. Total diabetes was defined as the proportion of adults (age 20+ years) who reported a previous diabetes diagnosis and/or had a high FPG/HbA1C. The age-standardized diabetes prevalence (%) was used as the outcome. The EQI was constructed for 2000-2005 for all US counties and is composed of five domains (air, water, built, land, and sociodemographic), each composed of variables to represent the environmental quality of that domain. Domain-specific EQIs were developed using principal components analysis (PCA) to reduce these variables within each domain while the overall EQI was constructed from a second PCA from these individual domains (L. C. Messer et al., 2014). To account for differences in environment across rural and urban counties, the overall and domain-specific EQIs were stratified by rural urban continuum codes (RUCCs) (U.S. Department of Agriculture, 2015). This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., A. Krajewski, S. Shaikh, D. Lobdell, and R. Sargis. Association between environmental quality and diabetes in the U.S.A.. Journal of Diabetes Investigation. John Wiley & Sons, Inc., Hoboken, NJ, USA, 11(2): 315-324, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of classification models (testing data).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Estimated US population prevalence and number needed to treat to avoid all-cause mortality in select NHANES patient subgroups.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveHearing loss can cause speech and language delays, communication barriers, and learning problems. Such factors are associated with reduced academic achievement, social isolation, decreased quality of life, and poorer health outcomes. We use a national cohort of children to examine how subclinical hearing loss is associated with academic/educational performance. The goal of this study is to determine if different levels of subclinical hearing loss (pure tone average ≤ 25 dB HL) are associated with educational testing outcomes in children.DesignAnalysis of children 6–16 years old who participated in the National Health and Nutrition Examination Survey (NHANES-III, 1988–1994) was performed. Air-conduction thresholds were measured at 0.5, 1, 2, 4, 6, and 8 kHz. A four-frequency pure-tone average (PTA) was calculated from 0.5, 1, 2, and 4 kHz. Hearing thresholds were divided into categories ( ≤ 0, 1–10, and 11–25 dB) for analysis. The outcomes of interest were the Wide Range Achievement Test (WRAT-R) and Wechsler Intelligence Scale for Children (WISC-R). Analysis was conducted using ANOVA and logistic regression.ResultsWe analyzed 3,965 participants. In univariable analysis, the average scores in scaled math, reading, digit span (short-term memory), and block design (visual-motor skills) were significantly lower with worsening hearing categories (p < 0.01). In multivariable regression PTAs of 1–10 dB HL (OR 1.72, 95% CI 1.29–2.29, p < 0.01) and 11-25 dB HL (OR: 2.99, 95% CI 1.3–6.65, p < 0.01), compared to PTA of ≤0 dB HL, were associated with poor reading test performance (
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Infection status by nicotine exposure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of NHANES 1999–2002 participants by quartiles (Q) of red blood cell (RBC) folate and dietary folate equivalents (DFE)1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of 5-fold cross-validation of classification models -training data (mean ± SD).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Key parameters of various machine learning models used for regression task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of 5-fold cross-validation of predictive models (training data).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Endometriosis is a common chronic inflammatory and estrogen-dependent disease that mostly affects people of childbearing age. The dietary inflammatory index (DII) is a novel instrument for assessing the overall inflammatory potential of diet. However, no studies have shown the relationship between DII and endometriosis to date. This study aimed to elucidate the relationship between DII and endometriosis. Data were acquired from the National Health and Nutrition Examination Survey (NHANES) 2001–2006. DII was calculated using an inbuilt function in the R package. Relevant patient information was obtained through a questionnaire containing their gynecological history. Based on an endometriosis questionnaire survey, those participants who answered yes were considered cases (with endometriosis), and participants who answered no were considered as controls (without endometriosis) group. Multivariate weighted logistic regression was applied to examine the correlation between DII and endometriosis. Subgroup analysis and smoothing curve between DII and endometriosis were conducted in a further investigation. Compared to the control group, patients were prone to having a higher DII (P = 0.014). Adjusted multivariate regression models showed that DII was positively correlated with the incidence of endometriosis (P < 0.05). Analysis of subgroups revealed no significant heterogeneity. In middle-aged and older women (age ≥ 35 years), the smoothing curve fitting analysis results demonstrated a non-linear relationship between DII and the prevalence of endometriosis. Therefore, using DII as an indicator of dietary-related inflammation may help to provide new insight into the role of diet in the prevention and management of endometriosis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hazard ratios (HR) of overall cancer and 95% confidence intervals (95% CI) by quartiles (Q) of red blood cell (RBC) folate, serum folate, and dietary folate equivalents (DFE) 1,2, NHANES 1999–2002.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: Bold represents P < 0.05Model 1: adjusted for age, sex, race/ethnicityModel 2: adjusted for model 1 + estimated glomerular filtration rate, albuminuria, hypertension, smoking status, body mass index, total cholesterol, HDL-cholesterol, self-reported cancer, aspartate aminotransferase, alanine aminotransferase, total bilirubin, alkaline phosphatase, hepatitis B virus core Igg status, hepatitis C virus Igg status, C-reactive protein, white blood cell count, and serum albumin*Between percentiles; 0.5 was used to indicate that this was between percentiles.Association between gamma gap and all-cause mortality with gamma gap dichotomized at different cutpoints (Hazard Ratios, 95% CI).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nonparametric correlation coefficients between log transformed HOMA-IR and HOMA-β.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics of the study cohort.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundPhysical activity (PA) is important for students in secondary school, however, trends in PA among secondary school students have shown a significant decline. There is a need to understand the PA of middle school students.ObjectiveThe first objective is to identify the PA levels and screen time of students in middle school. The second objective of the study is to examine the PA levels and screen time among students of different genders.MethodsParticipants from four consecutive two-year cycles of National Health and Nutrition Examination Survey (NHANES, 2011–2012, 2013–2014, 2015–2016, and 2017–2018) were included in this study. Spearman correlation model was used to identify the correlation between participants’ demographics, PA, and screen time data. Negative binomial regression model was used to describe students’ PA and screen time (Dependent variable) in different grades (Independent variables). Gender and Age were taken as control variables.ResultsAfter the data preprocessing, 2516 participants were included in this study. A significant correlation has been found between grade and PA, instead of screen time. Negative binomial regression shows that students have the lowest PA in their transition year grade 6, and their screen time decreased with the grade increased. Significant differences can be found across gender. Future efforts should focus on developing school transition support programs designed to improve PA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hazard ratios of overall cancer incidence and 95% confidence intervals (95% CI) by continuous levels of red blood cell (RBC) folate, serum folate, and dietary folate equivalents (DFE) 1, NHANES 1999–2002.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis of the exposome and secular trends on cancer mortality. csv Data Record: The curated NHANES datasets and the data dictionaries includes 13 .csv files and 1 excel file. The curated NHANES datasets involves 10 .csv formatted files, one for each module and labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. The eleventh file is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 4,740 variables in NHANES ("dictionary_nhanes.csv"). The 12th csv file contains the harmonized categories for the categorical variables ("dictionary_harmonized_categories.csv"). The 13th file contains the dictionary for descriptors on the drugs codes (“dictionary_drug_codes.csv”). The 14th file is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES datasets (“nhanes_inconsistencies_documentation.xlsx”). R Data Record: For researchers who want to conduct their analysis in the R programming language, the curated NHANES datasets and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. We provided an .RData file that contains all the aforementioned datasets as R data objects (“w - nhanes_1988_2018.RData”). Also in this .RData file, we make available all R scripts on customized functions that were written to curate the data. We also provide an .R file that shows how we used the customized functions (i.e. our pipeline) to curate the data (“m - nhanes_1988_2018.R”).