These datasets provide de-identified insurance data for diabetes. The data is provided by three managed care organizations in Allegheny County (Gateway Health Plan, Highmark Health, and UPMC) and represents their insured population for the 2015 and calendar years. Disclaimer: Users should be cautious of using administrative claims data as a measure of disease prevalence and interpreting trends over time, as data provided were collected for purposes other than surveillance. Limitations of these data include but are not limited to: misclassification, duplicate individuals, exclusion of individuals who did not seek care in past two years and those who are: uninsured, enrolled in plans not represented in the dataset, or were not enrolled in one of the represented plans for at least 90 days.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Content Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age: Age (years) Outcome: Class variable (0 or 1)
Inspiration Can you build a model (Machine learning or deep learning ) to accurately predict whether or not the patients in the dataset have diabetes or not?
Population-based county-level estimates for prevalence of DC were obtained from the Institute for Health Metrics and Evaluation (IHME) for the years 2004-2012 (16). DC prevalence rate was defined as the propor-tion of people within a county who had previously been diagnosed with diabetes (high fasting plasma glu-cose 126 mg/dL, hemoglobin A1c (HbA1c) of 6.5%, or diabetes diagnosis) but do not currently have high fasting plasma glucose or HbA1c for the period 2004-2012. DC prevalence estimates were calculated using a two-stage approach. The first stage used National Health and Nutrition Examination Survey (NHANES) data to predict high fasting plasma glucose (FPG) levels (≥126 mg/dL) and/or HbA1C levels (≥6.5% [48 mmol/mol]) based on self-reported demographic and behavioral characteristics (16). This model was then applied to Behavioral Risk Factor Surveillance System (BRFSS) data to impute high FPG and/or HbA1C status for each BRFSS respondent (16). The second stage used the imputed BRFSS data to fit a series of small area models, which were used to predict county-level prevalence of diabetes-related outcomes, including DC (16). The EQI was constructed for 2006-2010 for all US counties and is composed of five domains (air, water, built, land, and sociodemographic), each composed of variables to represent the environmental quality of that domain. Domain-specific EQIs were developed using principal components analysis (PCA) to reduce these variables within each domain while the overall EQI was constructed from a second PCA from these individual domains (L. C. Messer et al., 2014). To account for differences in environment across rural and urban counties, the overall and domain-specific EQIs were stratified by rural urban continuum codes (RUCCs) (U.S. Department of Agriculture, 2015). Results are reported as prevalence rate differences (PRD) with 95% confidence intervals (CIs) comparing the highest quintile/worst environmental quality to the lowest quintile/best environmental quality expo-sure metrics. PRDs are representative of the entire period of interest, 2004-2012. Due to availability of DC data and covariate data, not all counties were captured, however, the majority, 3134 of 3142 were utilized in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., A. Krajewski, K. Price, D. Lobdell, and R. Sargis. Diabetes control is associated with environmental quality in the USA. Endocrine Connections. BioScientifica Ltd., Bristol, UK, 10(9): 1018-1026, (2021).
This is a source dataset for a Let's Get Healthy California indicator at "https://letsgethealthy.ca.gov/. This table displays the prevalence of diabetes in California. It contains data for California only. The data are from the California Behavioral Risk Factor Surveillance Survey (BRFSS). The California BRFSS is an annual cross-sectional health-related telephone survey that collects data about California residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. The BRFSS is conducted by Public Health Survey Research Program of California State University, Sacramento under contract from CDPH. This prevalence rate does not include pre-diabetes, or gestational diabetes. This is based on the question: "Has a doctor, or nurse or other health professional ever told you that you have diabetes?" The sample size for 2014 was 8,832. NOTE: Denominator data and weighting was taken from the California Department of Finance, not U.S. Census. Values may therefore differ from what has been published in the national BRFSS data tables by the Centers for Disease Control and Prevention (CDC) or other federal agencies.
This dataset was created by Yendoh Derek
Released under Other (specified in description)
This data set provides de-identified population data for diabetes and hypertension comorbidity prevalence in Allegheny County. The data is provided by three managed care organizations in Allegheny County (Gateway Health Plan, Highmark Health, and UPMC) and represents their insured population for the 2015 and 2016 calendar years. Disclaimer: Users should be cautious of using administrative claims data as a measure of disease prevalence and interpreting trends over time, as data provided were collected for purposes other than surveillance. Limitations of these data include but are not limited to: misclassification, duplicate individuals, exclusion of individuals who did not seek care in past two years and those who are: uninsured, enrolled in plans not represented in the dataset, or were not enrolled in one of the represented plans for at least 90 days.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Omar Belfeki
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Diabetes.csv and arff’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/amrikkatoch308/diabetescsv-and-arff on 14 February 2022.
--- No further description of dataset provided by original source ---
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is from a project investigating the role diabetes self-management, knowledge and management self-efficacy have on clinical targets among Type 2 diabetes patients in Thailand. Data have been de-identified. The patient data is in the file MontiFinal.csv, and a description of the variables contained therein are provided in DataDictionary.xls
This dataset contain data from 204 participants from the pilot period of the AI-READI project (July 19, 2023 to November 30, 2023). Data from multiple modalities are included. The data in this dataset contain no protected health information (PHI). Information related to the sex and race/ethnicity of the participants as well as medication used has also been removed. A detailed description of the dataset is available in the AI-READI documentation for v1.0.0 of the dataset at https://docs.aireadi.org
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes is a chronic disease that must be constantly monitored, especially in cases of type 1 and 2 diabetes mellitus. Nowadays, technology is helping society to provide innovative solutions in the field of health through sensors or smart devices. In this field, continuous glucose sensors are a huge advance in the development of artificial intelligence algorithms capable of predicting glucose values or obtaining any type of relevant information to improve the quality of patients' health. Unfortunately few datasets exist in this area. Therefore, this study aims to provide the scientific community with a dataset of a type 1 diabetic patient during the period 2023/09/10 and 2024/05/13 (226 days with data).
Data Records
The data are recorded in a single file entitled glucose_data.csv. This file establishes a Comma Separated Values (CSV) format.
The following characteristics can be found in each row of the dataset:
The dataset contains a total of 41702 samples with an average of 185 samples per day. A summary of the samples per day can be found in the attached image.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Dayam Nadeem
Released under MIT
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Diabetes Dataset - Pima Indians Dataset?
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.2 From the data set in the (.csv) File We can find several variables, some of them are independent (several medical predictor variables) and only one target dependent variable (Outcome).
https://user-images.githubusercontent.com/36210723/179423454-754b0e67-3b28-461c-afdc-96537e65d93c.png" alt="178112363-36a719ea-2f2f-4131-9ec4-83f6bb2194f1">
.
Acknowledgments
When we use this dataset in our research, we credit the authors as :
License : CC0: Public Domain.
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press, and it is published t to reuse in the google research dataset.
The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is raw data from a cross secional study of 510 people living with diabetes attending the Queen Elizabeth Central Hospital diabetes clinic. Ethical approval for the study was granted by the College of Medicine Research and Ethics Committee (Ref: P.08/17/229). The data were collected between November 2017 and May 2018 using an interviewer administered questionnaire that solicited data on participants demographic and clinical clinical characteristics, five social cognitive theory factors (self-efficacy, outcome expectations, knowledge, social support and barriers to self-management) and self-management (diet, exercise, foot care, medication, self-monitoring of blood glucose and smoking). The data were entered into a Microsoft Access database ten exported into Stata version 14.0 for cleaning and analysis.
This dataset was created by Niels B0hr
Population-based county-level estimates for diagnosed (DDP), undiagnosed (UDP), and total diabetes prevalence (TDP) were acquired from the Institute for Health Metrics and Evaluation (IHME) for the years 2004-2012 (Evaluation 2017). Prevalence estimates were calculated using a two-stage approach. The first stage used National Health and Nutrition Examination Survey (NHANES) data to predict high fasting plasma glucose (FPG) levels (≥126 mg/dL) and/or hemoglobin A1C (HbA1C) levels (≥6.5% [48 mmol/mol]) based on self-reported demographic and behavioral characteristics (Dwyer-Lindgren, Mackenbach et al. 2016). This model was then applied to Behavioral Risk Factor Surveillance System (BRFSS) data to impute high FPG and/or A1C status for each BRFSS respondent (Dwyer-Lindgren, Mackenbach et al. 2016). The second stage used the imputed BRFSS data to fit a series of small area models, which were used to predict the county-level prevalence of each of the diabetes-related outcomes (Dwyer-Lindgren, Mackenbach et al. 2016). Diagnosed diabetes was defined as the proportion of adults (age 20+ years) who reported a previous diabetes diagnosis, represented as an age-standardized prevalence percentage. Undiagnosed diabetes was defined as proportion of adults (age 20+ years) who have a high FPG or HbA1C but did not report a previous diagnosis of diabetes. Total diabetes was defined as the proportion of adults (age 20+ years) who reported a previous diabetes diagnosis and/or had a high FPG/HbA1C. The age-standardized diabetes prevalence (%) was used as the outcome. The EQI was constructed for 2000-2005 for all US counties and is composed of five domains (air, water, built, land, and sociodemographic), each composed of variables to represent the environmental quality of that _domain. Domain-specific EQIs were developed using principal components analysis (PCA) to reduce these variables within each _domain while the overall EQI was constructed from a second PCA from these individual domains (L. C. Messer et al., 2014). To account for differences in environment across rural and urban counties, the overall and _domain-specific EQIs were stratified by rural urban continuum codes (RUCCs) (U.S. Department of Agriculture, 2015). This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., A. Krajewski, S. Shaikh, D. Lobdell, and R. Sargis. Association between environmental quality and diabetes in the U.S.A.. Journal of Diabetes Investigation. John Wiley & Sons, Inc., Hoboken, NJ, USA, 11(2): 315-324, (2020).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context Diabetes is one of the most prevalent chronic diseases in the United States, affecting millions of Americans each year and placing a substantial financial burden on the economy. It is a serious chronic condition in which the body loses the ability to effectively regulate blood glucose levels, leading to a reduced quality of life and decreased life expectancy. During digestion, food is broken down into sugars, which enter the bloodstream. This triggers the pancreas to release insulin, a hormone that helps cells in the body use these sugars for energy. Diabetes is typically characterized by either insufficient insulin production or the body's inability to use insulin effectively.
Chronic high blood sugar levels in individuals with diabetes can lead to severe complications, including heart disease, vision loss, kidney disease, and lower-limb amputation. Although there is no cure for diabetes, strategies such as maintaining a healthy weight, eating a balanced diet, staying physically active, and receiving medical treatments can help mitigate its effects. Early diagnosis is crucial, as it allows for lifestyle modifications and more effective treatment, making predictive models for assessing diabetes risk valuable tools for public health officials.
The scale of the diabetes epidemic is significant. According to the Centers for Disease Control and Prevention (CDC), as of 2018, approximately 34.2 million Americans have diabetes, while 88 million have prediabetes. Alarmingly, the CDC estimates that 1 in 5 individuals with diabetes and about 8 in 10 individuals with prediabetes are unaware of their condition. Type II diabetes is the most common form, and its prevalence varies based on factors such as age, education, income, geographic location, race, and other social determinants of health. The burden of diabetes disproportionately affects those with lower socioeconomic status. The economic impact is also substantial, with the cost of diagnosed diabetes reaching approximately $327 billion annually, and total costs, including undiagnosed diabetes and prediabetes, nearing $400 billion each year.
Content The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a XPT of the dataset available on CDC website for the year 2023 was used. This original dataset contains responses from 433,323 individuals and has 345 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.
I have selected 20 features from this dataset that are suitable for working on the topic of diabetes, and I have saved them in a CSV file without making any changes to the data. The goal of this is to make it easier to work with the data. For more information or to access updated data, you can refer to the CDC website. I initially examined the original dataset from the CDC and found no duplicate entries. That dataset contains 330 columns and features. Therefore, the duplicate cases in this dataset are not due to errors but rather represent individuals with similar conditions. In my opinion, removing these entries would both introduce errors and reduce accuracy.
Explore some of the following research questions: - Can survey questions from the BRFSS provide accurate predictions of whether an individual has diabetes? - What risk factors are most predictive of diabetes risk? - Can we use a subset of the risk factors to accurately predict whether an individual has diabetes? - Can we create a short form of questions from the BRFSS using feature selection to accurately predict if someone might have diabetes or is at high risk of diabetes?
Acknowledgements It is important to reiterate that I did not create this dataset, it is simply a summarized and reformatted dataset derived from the BRFSS 2023 dataset available on the CDC website. It is also worth noting that none of the data in this dataset discloses individuals' identities.
Inspiration Zidian Xie et al for Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques using the 2014 BRFSS, and Alex Teboul for building Diabetes Health Indicators dataset based on BRFSS 2015 were the inspiration for creating this dataset and exploring the BRFSS in general.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes is a chronic disease that must be constantly monitored, especially in cases of type 1 and 2 diabetes mellitus. Nowadays, technology is helping society to provide innovative solutions in the field of health through sensors or smart devices. In this field, continuous glucose sensors are a huge advance in the development of artificial intelligence algorithms capable of predicting glucose values or obtaining any type of relevant information to improve the quality of patients' health. Unfortunately few datasets exist in this area. Therefore, this study aims to provide the scientific community with a dataset of a type 1 diabetic patient during the period 2023/09/10 and 2024/02/26 (149 days with data).
Data Records
The data are recorded in a single file entitled glucose_data.csv. This file establishes a Comma Separated Values (CSV) format.
The following characteristics can be found in each row of the dataset:
The dataset contains a total of 29137 samples with an average of 191 samples per day. A summary of the samples per day can be found in the attached image.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Rajesh Sahani
Released under Apache 2.0
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
A door-to-door survey was conducted to enumerate all household members by age and sex in a low-income community in Peru. 856 adults 35 years and older were eligible to participate in screening for type 2 diabetes and hypertension. 709 (83%) participated in screening. 130 (18.3%) were diagnosed with hypertension and/or type 2 diabetes of which 109 (84%) participated at program onset and 22 were added later from earlier non-participants in screening or program onset to form the cohort of 131 patients with diabetes and/or hypertension. The primary care program had components of the Chronic Care Model, community health workers, and freely accessible visits and medications. The program operated between September 2011 and May 2014, and consisted of two care periods (separated by a six-month hiatus), first a 10-month home-care period, then a 17-month clinic-care period. The dataset is two files corresponding to two exposures: the 27-month program overall (post- versus pre-) (N=262 observations, 131 pairs with patients as self-controls) and care period (clinic versus home), N=211 (109 home and 102 clinic observations, >131 because 80 patients participated in both care periods). Exposures were evaluated for their effects on guidelines-based pharmacotherapy standards: hypoglycemic and antihypertensive medications, low-dose aspirin, and first-line angiotensin converting enzyme inhibitor (ACEi) treatment of diabetes with elevated blood pressure. Methods From 2011 to 2014, data was collected prospectively, during weekly (home visits) or monthly (clinic visits), on paper encounter forms that were entered into Microsoft Excel as part of the standard operation of the community-based program. In January 2020, the University of Arizona institutional review board approved the use of the de-identified data for a study of the program's effects on clinical outcomes. Time-series data (fasting glucose and blood pressure) was collapsed on the median of monthly average fasting glucose and blood pressure values during the program (27 months) and the respective care periods, home (10 months) and clinic (17 months). Antihypertensive and hypoglycemic agents were collapsed on the highest dose ever received, angiotensin-converting enzyme inhibitors (ACEi) and aspirin on whether any dose was ever received, by treatment-eligible groups, and within program and care period time intervals. Retention in care was obtained by counting visits and elapsed months (from first to last patient encounters) during the program and care periods. Treatment-eligible groups were low-dose aspirin candidates (10-year cardiovascular disease (CVD) risk >=10% by the Framingham alternate model that uses clinical factors only, no laboratory factors; blood pressure (BP) treatment candidates (BP >=130/80 mm Hg if diabetic or >=140/90 mm Hg if non-diabetic); hypoglycemic agent candidates (patients with diabetes); and diabetic ACEi candidates (diabetes with BP >=130/80 mm Hg). Data has been transformed into two files corresponding to two exposures: 1) program, post- versus pre- (referent), N=262 observations; and 2) care period, clinic versus home (referent), N=211 observations. There are two data files in text (comma-delimited) format. Pre-post....csv contains the 262 observations for the program exposure. Care period....csv contains the 211 care period observations. Each file has a data dictionary also in comma-delimited format. "Pre-post data dict....csv" describes the variables in the program exposure study. "Care period data dict....csv" describes the variables in the care period exposure study.
These datasets provide de-identified insurance data for diabetes. The data is provided by three managed care organizations in Allegheny County (Gateway Health Plan, Highmark Health, and UPMC) and represents their insured population for the 2015 and calendar years. Disclaimer: Users should be cautious of using administrative claims data as a measure of disease prevalence and interpreting trends over time, as data provided were collected for purposes other than surveillance. Limitations of these data include but are not limited to: misclassification, duplicate individuals, exclusion of individuals who did not seek care in past two years and those who are: uninsured, enrolled in plans not represented in the dataset, or were not enrolled in one of the represented plans for at least 90 days.