Population based cancer incidence rates were abstracted from National Cancer Institute, State Cancer Profiles for all available counties in the United States for which data were available. This is a national county-level database of cancer data that are collected by state public health surveillance systems. All-site cancer is defined as any type of cancer that is captured in the state registry data, though non-melanoma skin cancer is not included. All-site age-adjusted cancer incidence rates were abstracted separately for males and females. County-level annual age-adjusted all-site cancer incidence rates for years 2006–2010 were available for 2687 of 3142 (85.5%) counties in the U.S. Counties for which there are fewer than 16 reported cases in a specific area-sex-race category are suppressed to ensure confidentiality and stability of rate estimates; this accounted for 14 counties in our study. Two states, Kansas and Virginia, do not provide data because of state legislation and regulations which prohibit the release of county level data to outside entities. Data from Michigan does not include cases diagnosed in other states because data exchange agreements prohibit the release of data to third parties. Finally, state data is not available for three states, Minnesota, Ohio, and Washington. The age-adjusted average annual incidence rate for all counties was 453.7 per 100,000 persons. We selected 2006–2010 as it is subsequent in time to the EQI exposure data which was constructed to represent the years 2000–2005. We also gathered data for the three leading causes of cancer for males (lung, prostate, and colorectal) and females (lung, breast, and colorectal). The EQI was used as an exposure metric as an indicator of cumulative environmental exposures at the county-level representing the period 2000 to 2005. A complete description of the datasets used in the EQI are provided in Lobdell et al. and methods used for index construction are described by Messer et al. The EQI was developed for the period 2000– 2005 because it was the time period for which the most recent data were available when index construction was initiated. The EQI includes variables representing each of the environmental domains. The air domain includes 87 variables representing criteria and hazardous air pollutants. The water domain includes 80 variables representing overall water quality, general water contamination, recreational water quality, drinking water quality, atmospheric deposition, drought, and chemical contamination. The land domain includes 26 variables representing agriculture, pesticides, contaminants, facilities, and radon. The built domain includes 14 variables representing roads, highway/road safety, public transit behavior, business environment, and subsidized housing environment. The sociodemographic environment includes 12 variables representing socioeconomics and crime. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., L. Messer, K. Rappazzo , C. Gray, S. Grabich , and D. Lobdell. County-level environmental quality and associations with cancer incidence#. Cancer. John Wiley & Sons Incorporated, New York, NY, USA, 123(15): 2901-2908, (2017).
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Rapid Cancer Registration Data (RCRD) provides a quick, indicative source of cancer data. It is provided to support the planning and provision of cancer services. The data is based on a rapid processing of cancer registration data sources, in particular on Cancer Outcomes and Services Dataset (COSD) information. In comparison, National Cancer Registration Data (NCRD) relies on additional data sources, enhanced follow-up with trusts and expert processing by cancer registration officers. The Rapid Cancer Registration Data (RCRD) may be useful for service improvement projects including healthcare planning and prioritisation. However, it is poorly suited for epidemiological research due to limitations in the data quality and completeness.
Number and rate of new cancer cases diagnosed annually from 1992 to the most recent diagnosis year available. Included are all invasive cancers and in situ bladder cancer with cases defined using the Surveillance, Epidemiology and End Results (SEER) Groups for Primary Site based on the World Health Organization International Classification of Diseases for Oncology, Third Edition (ICD-O-3). Random rounding of case counts to the nearest multiple of 5 is used to prevent inappropriate disclosure of health-related information.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A measure of the number of adults diagnosed with any type of cancer in a year who are still alive one year after diagnosis. Purpose This indicator attempts to capture the success of the NHS in preventing people from dying once they have been diagnosed with any type of cancer. Current version updated: Feb-17 Next version due: Feb-18
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
One-year and five-year net survival for adults (15-99) in England diagnosed with one of 29 common cancers, by age and sex.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Lung Cancer
Dataset Summary
The effectiveness of cancer prediction system helps the people to know their cancer risk with low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system .
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/lung-cancer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Links to code and bioRxiv pre-print:
1. Multi-lens Neural Machine (MLNM) Code
2. An AI-assisted Tool For Efficient Prostate Cancer Diagnosis (bioRxiv Pre-print)
Digitized hematoxylin and eosin (H&E)-stained whole-slide-images (WSIs) of 40 prostatectomy and 59 core needle biopsy specimens were collected from 99 prostate cancer patients at Tan Tock Seng Hospital, Singapore. There were 99 WSIs in total such that each specimen had one WSI. H&E-stained slides were scanned at 40× magnification (specimen-level pixel size 0·25μm × 0·25μm) using Aperio AT2 Slide Scanner (Leica Biosystems). Institutional board review from the hospital were obtained for this study, and all the data were de-identified.
Prostate glandular structures in core needle biopsy slides were manually annotated and classified using the ASAP annotation tool (ASAP). A senior pathologist reviewed 10% of the annotations in each slide, ensuring that some reference annotations were provided to the researcher at different regions of the core. It is to be noted that partial glands appearing at the edges of the biopsy cores were not annotated.
Patches of size 512 × 512 pixels were cropped from whole slide images at resolutions 5×, 10×, 20×, and 40× with an annotated gland centered at each patch. This dataset contains these cropped images.
This dataset is used to train two AI models for Gland Segmentation (99 patients) and Gland Classification (46 patients). Tables 1 and 2 illustrate both gland segmentation and gland classification datasets. We have put the two corresponding sub-datasets as two zip files as follows:
Table 1: The number of slides and patches in training, validation, and test sets for gland segmentation task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen.
|
#Slides |
|
|
|
|
Train |
Valid |
Test |
Total |
Prostatectomy |
17 |
8 |
15 |
40 |
Biopsy |
26 |
13 |
20 |
59 |
Total |
43 |
21 |
35 |
99 |
|
#Patches |
|
|
|
|
Train |
Valid |
Test |
Total |
Prostatectomy |
7795 |
3753 |
7224 |
18772 |
Biopsy |
5559 |
4028 |
5981 |
15568 |
Total |
13354 |
7781 |
13205 |
34340 |
Table 2: The number of slides and patches in training, validation, and test sets for gland classification task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen. The gland classification datasets are the subsets of the gland segmentation datasets. GS: Gleason Score. B: Benign. M: Malignant.
|
#Slides (GS 3+3:3+4:4+3) |
|
|
|
|
Train |
Valid |
Test |
Total |
Biopsy |
10:9:1 |
3:7:0 |
6:10:0 |
19:26:1 |
|
#Patches (B:M) |
|
|
|
|
Train |
Valid |
Test |
Total |
Biopsy |
1557:2277 |
1216:1341 |
1543:2718 |
4316:6336 |
NB: Gland classification folder (gland_classification_dataset.zip) may contain extra patches, labels of which could not be identified from H&E slides. They were not used in the machine learning study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains information of 213 cancer patients undergoing clinical or surgical treatment characterized on sociodemographic and clinical data as well as data from the Care Transition Measure (CTM 15-Brazil). Data collection was carried out 7 to 30 days after their discharge from hospital from June to August 2019. Understanding these data can contribute to improving quality of care transitions and avoiding hospital readmissions. To this end, this dataset contains a broad array of variables:
*gender
*age group
*place of residence
*race
*marital status
*schooling
*paid work activity
*type of treatment
*cancer staging
*metastasis
*comorbidities
*main complaint
*continue use medication
*diagnosis
*cancer type
*diagnostic year
*oncology treatment
*first hospitalization
*readmission in the last 30 days
*number of hospitalizations in the last 30 days
*readmission in the last 6 months
*number of hospitalizations in the last 6 months
*readmission in the last year
*number of hospitalizations in the last year
*questions 1-15 from CTM 15-Brazil
The data are presented as a single Excel XLSX file: cancer patient´s care transitions dataset.xlsx.
The analyses of the present dataset have the potential to generate hospital readmission prevention strategies to be implemented by the hospital team. Researchers who are interested in CTs of cancer patients can extensively explore the variables described here.
The project from which these data were extracted was approved by the institution’s research ethics committee (approval n. 3.266.259/2019) at Associação Hospital de Caridade Ijuí, Rio Grande do Sul, Brazil.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A measure of the number of adults diagnosed with any type of cancer in a year who are still alive five years after diagnosis. Purpose This indicator attempts to capture the success of the NHS in preventing people from dying once they have been diagnosed with any type of cancer. Current version updated: Feb-17 Next version due: Feb-18
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pancreatic cancer is an extremely deadly type of cancer. Once diagnosed, the five-year survival rate is less than 10%. However, if pancreatic cancer is caught early, the odds of surviving are much better. Unfortunately, many cases of pancreatic cancer show no symptoms until the cancer has spread throughout the body. A diagnostic test to identify people with pancreatic cancer could be enormously helpful.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set from- What Defines Quality of Life for Older Patients Diagnosed with Cancer? A Qualitative Study
Abstract of the study: The treatment of cancer can have a significant impact on quality of life in older patients and this needs to be taken into account in decision making. However, quality of life can consist of many different components with varying importance between individuals. We set out to assess how older patients with cancer define quality of life and the components that are most significant to them. This was a single-centre, qualitative interview study. Patients aged 70 years or older with cancer were asked to answer open-ended questions: What makes life worthwhile? What does quality of life mean to you? What could affect your quality of life? Subsequently, they were asked to choose the five most important determinants of quality of life from a predefined list: cognition, contact with family or with community, independence, staying in your own home, helping others, having enough energy, emotional well-being, life satisfaction, religion and leisure activities. Afterwards, answers to the open-ended questions were independently categorized by two authors. The proportion of patients mentioning each category in the open-ended questions were compared to the predefined questions. Overall, 63 patients (median age 76 years) were included. When asked, “What makes life worthwhile?”, patients identified social functioning (86%) most frequently. Moreover, to define quality of life, patients most frequently mentioned categories in the domains of physical functioning (70%) and physical health (48%). Maintaining cognition was mentioned in 17% of the open-ended questions and it was the most commonly chosen option from the list of determinants (72% of respondents). In conclusion, physical functioning, social functioning, physical health and cognition are important components in quality of life. When discussing treatment options, the impact of treatment on these aspects should be taken into consideration.
Reference of research paper: Seghers PAL, Kregting JA, van Huis-Tanja LH, Soubeyran P, O'Hanlon S, Rostoft S, Hamaker ME, Portielje JEA. What Defines Quality of Life for Older Patients Diagnosed with Cancer? A Qualitative Study. Cancers. 2022; 14(5):1123. https://doi.org/10.3390/cancers14051123
Content of the data set: The first Tab describes what questions were asked, the second tab shows all individual anonymised answers to the open questions, the fourth shows the definitions that were used to classify all answers. Q1-Q4 show how the answers were categorised.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of CT brain scans with cancer, tumor, and aneurysm. Each scan represents a detailed image of a patient's brain taken using CT (Computed Tomography). The data are presented in 2 different formats: .jpg and .dcm.
The dataset of CT brain scans is valuable for research in neurology, radiology, and oncology. It allows the development and evaluation of computer-based algorithms, machine learning models, and deep learning techniques for automated detection, diagnosis, and classification of these conditions.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fd534483d76552e312cf094fbe23d8cc5%2Fezgif.com-optimize.gif?generation=1697211124166914&alt=media" alt="">
keywords: aneurysm, cancer detection, cancer segmentation, tumor, computed tomography, head, skull, brain scan, eye sockets, sinuses, medical imaging, radiology dataset, neurology dataset, oncology dataset, image dataset, abnormalities detection, brain anatomy, health, brain formations, imaging procedure, x-rays measurements, machine learning, computer vision, deep learning
https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
Welcome to the the CSAW-M dataset homepageThis page includes the files and metadata related to the CSAW-M, a curated dataset of mammograms with expert assessments of the masking of cancer. CSAW-M is collected from over 10,000 individuals and annotated with potential masking. In contrast to the previous approaches which measure breast image density as a proxy, our dataset directly provides annotations of masking potential assessments from five specialists. We trained deep learning models on CSAW-M to estimate the masking level, and showed that the estimated masking is significantly more predictive of screening participants diagnosed with interval and large invasive cancers — without being explicitly trained for these tasks — than its breast density counterparts. Please find the paper corresponding to our work here and the GitHub repo here.CSAW-M Research Use LicensePlease read carefully all the terms and conditions of the CSAW-M Research Use License. How to access the dataset:If you want to get access to the data, please use the "Request access to files" option above (currently, non-Swedish researchers need to have a general figshare account to be able to to request access). We will ask you to agree to our terms of conditions and provide us with some information about what you will use the data for. We will then receive the request and process it, after which you would be able to download all the files.If you use this Work, please cite our paper:@article{sorkhei2021csaw, title={CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer}, author={Sorkhei, Moein and Liu, Yue and Azizpour, Hossein and Azavedo, Edward and Dembrower, Karin and Ntoula, Dimitra and Zouzos, Athanasios and Strand, Fredrik and Smith, Kevin}, year={2021} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area. The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset. Acknowledgements: This dataset has been referred from Kaggle. Objective: Understand the Dataset & cleanup (if required). Build classification models to predict whether the cancer type is Malignant or Benign. Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Age-standardised rate of mortality from oral cancer (ICD-10 codes C00-C14) in persons of all ages and sexes per 100,000 population.RationaleOver the last decade in the UK (between 2003-2005 and 2012-2014), oral cancer mortality rates have increased by 20% for males and 19% for females1Five year survival rates are 56%. Most oral cancers are triggered by tobacco and alcohol, which together account for 75% of cases2. Cigarette smoking is associated with an increased risk of the more common forms of oral cancer. The risk among cigarette smokers is estimated to be 10 times that for non-smokers. More intense use of tobacco increases the risk, while ceasing to smoke for 10 years or more reduces it to almost the same as that of non-smokers3. Oral cancer mortality rates can be used in conjunction with registration data to inform service planning as well as comparing survival rates across areas of England to assess the impact of public health prevention policies such as smoking cessation.References:(1) Cancer Research Campaign. Cancer Statistics: Oral – UK. London: CRC, 2000.(2) Blot WJ, McLaughlin JK, Winn DM et al. Smoking and drinking in relation to oral and pharyngeal cancer. Cancer Res 1988; 48: 3282-7. (3) La Vecchia C, Tavani A, Franceschi S et al. Epidemiology and prevention of oral cancer. Oral Oncology 1997; 33: 302-12.Definition of numeratorAll cancer mortality for lip, oral cavity and pharynx (ICD-10 C00-C14) in the respective calendar years aggregated into quinary age bands (0-4, 5-9,…, 85-89, 90+). This does not include secondary cancers or recurrences. Data are reported according to the calendar year in which the cancer was diagnosed.Counts of deaths for years up to and including 2019 have been adjusted where needed to take account of the MUSE ICD-10 coding change introduced in 2020. Detailed guidance on the MUSE implementation is available at: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/articles/causeofdeathcodinginmortalitystatisticssoftwarechanges/january2020Counts of deaths for years up to and including 2013 have been double adjusted by applying comparability ratios from both the IRIS coding change and the MUSE coding change where needed to take account of both the MUSE ICD-10 coding change and the IRIS ICD-10 coding change introduced in 2014. The detailed guidance on the IRIS implementation is available at: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/bulletins/impactoftheimplementationofirissoftwareforicd10causeofdeathcodingonmortalitystatisticsenglandandwales/2014-08-08Counts of deaths for years up to and including 2010 have been triple adjusted by applying comparability ratios from the 2011 coding change, the IRIS coding change and the MUSE coding change where needed to take account of the MUSE ICD-10 coding change, the IRIS ICD-10 coding change and the ICD-10 coding change introduced in 2011. The detailed guidance on the 2011 implementation is available at https://webarchive.nationalarchives.gov.uk/ukgwa/20160108084125/http://www.ons.gov.uk/ons/guide-method/classifications/international-standard-classifications/icd-10-for-mortality/comparability-ratios/index.htmlDefinition of denominatorPopulation-years (aggregated populations for the three years) for people of all ages, aggregated into quinary age bands (0-4, 5-9, …, 85-89, 90+)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Processing of the huge 314GB+ Dataset (Include 54713 Images) of this competition into TFRecords for fast dataloading during training.
All images are resized to 768x1280 and saved in 100 TFRecords, making each TFRecord contain roughly 548 images as 8.6GB+ Dataset.
TFRecords have the benefit of loading large chunks of data containing many samples instead of loading every image and label seperately.
Dataset Description
Note: The dataset for this challenge contains radiographic breast images of female subjects. The goal of this competition is to identify cases of breast cancer in mammograms from screening exams. It is important to identify cases of cancer for obvious reasons, but false positives also have downsides for patients. As millions of women get mammograms each year, a useful machine learning tool could help a great many people. This competition uses a hidden test. When your submitted notebook is scored the actual test data (including a full length sample submission) will be made available to your notebook.
Files
[train/test]_images/[patient_id]/[image_id].dcm The mammograms, in dicom format. You can expect roughly 8,000 patients in the hidden test set. There are usually but not always 4 images per patient. Note that many of the images use the jpeg 2000 format which may you may need special libraries to load.
sample_submission.csv A valid sample submission. Only the first few rows are available for download.
[train/test].csv Metadata for each patient and image. Only the first few rows of the test set are available for download.
site_id - ID code for the source hospital. patient_id - ID code for the patient. image_id - ID code for the image. laterality - Whether the image is of the left or right breast. view - The orientation of the image. The default for a screening exam is to capture two views per breast. age - The patient's age in years. implant - Whether or not the patient had breast implants. Site 1 only provides breast implant information at the patient level, not at the breast level. density - A rating for how dense the breast tissue is, with A being the least dense and D being the most dense. Extremely dense tissue can make diagnosis more difficult. Only provided for train. machine_id - An ID code for the imaging device. cancer - Whether or not the breast was positive for malignant cancer. The target value. Only provided for train. biopsy - Whether or not a follow-up biopsy was performed on the breast. Only provided for train. invasive - If the breast is positive for cancer, whether or not the cancer proved to be invasive. Only provided for train. BIRADS - 0 if the breast required follow-up, 1 if the breast was rated as negative for cancer, and 2 if the breast was rated as normal. Only provided for train. prediction_id - The ID for the matching submission row. Multiple images will share the same prediction ID. Test only. difficult_negative_case - True if the case was unusually difficult. Only provided for train.
https://digital.nhs.uk/services/data-access-request-service-darshttps://digital.nhs.uk/services/data-access-request-service-dars
The National Cancer Registration and Analysis Service (NCRAS) at Public Health England supplies cancer registration data to NHS Digital. This data is available to be linked to other data held by NHS Digital in order to provide notifications on an individual's cancer status, be available to support research studies and to identify potential research participants for clinical trials.
NCRAS is the population-based cancer registry for England. It collects, quality assures and analyses data on all people living in England who are diagnosed with malignant and pre-malignant neoplasms, with national coverage since 1971.
The Cancer Registration dataset comprises England data to the present day, and Welsh data up to April 2017.
Timescales for dissemination of agreed data can be found under 'Our Service Levels' at the following link: https://digital.nhs.uk/services/data-access-request-service-dars/data-access-request-service-dars-process Standard response
All individuals diagnosed with cancer from 2000 to 2007 were identified in the Cancer Register of Southern Sweden, but only individuals who were also identified in the Population Register of Scania were included in this cohort. Age- and gender-matched controls were identified in the Population Register of Scania. The controls were reconciled with the cancer registry in southern Sweden so that they had no prior diagnosis of cancer and with the Population Register of Scania that they were alive at time of diagnosis to the matched case. Also spouses to cancer patients were used as controls.
For each individual, healthcare costs were monitored related to the date of diagnosis. Costs for outpatient care, inpatient care, number of days in hospital and medications were included. Costs were also calculated for the controls.
Other information available about the individuals in the cohort are age, sex, domicile, type of tumor and medication.
Purpose:
To study the health cost per individual in relation to mortality and comorbidity.
Dataset includes the study controls (individuals matched by age and sex ) Also spouses to cancer patients were included in the control group.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset provides the annual number of people diagnosed with rare cancer since 2010
https://ega-archive.org/dacs/EGAC00001000701https://ega-archive.org/dacs/EGAC00001000701
The dataset for Direct Detection of Early-Stage Cancers using Circulating Tumor DNA includes 602 bam files from next-generation sequencing on the Illumina HiSeq2500 or MiSeq. The samples analyzed include cancer cell lines as well as plasma and tissue specimens from healthy individuals and patients with cancer.
Population based cancer incidence rates were abstracted from National Cancer Institute, State Cancer Profiles for all available counties in the United States for which data were available. This is a national county-level database of cancer data that are collected by state public health surveillance systems. All-site cancer is defined as any type of cancer that is captured in the state registry data, though non-melanoma skin cancer is not included. All-site age-adjusted cancer incidence rates were abstracted separately for males and females. County-level annual age-adjusted all-site cancer incidence rates for years 2006–2010 were available for 2687 of 3142 (85.5%) counties in the U.S. Counties for which there are fewer than 16 reported cases in a specific area-sex-race category are suppressed to ensure confidentiality and stability of rate estimates; this accounted for 14 counties in our study. Two states, Kansas and Virginia, do not provide data because of state legislation and regulations which prohibit the release of county level data to outside entities. Data from Michigan does not include cases diagnosed in other states because data exchange agreements prohibit the release of data to third parties. Finally, state data is not available for three states, Minnesota, Ohio, and Washington. The age-adjusted average annual incidence rate for all counties was 453.7 per 100,000 persons. We selected 2006–2010 as it is subsequent in time to the EQI exposure data which was constructed to represent the years 2000–2005. We also gathered data for the three leading causes of cancer for males (lung, prostate, and colorectal) and females (lung, breast, and colorectal). The EQI was used as an exposure metric as an indicator of cumulative environmental exposures at the county-level representing the period 2000 to 2005. A complete description of the datasets used in the EQI are provided in Lobdell et al. and methods used for index construction are described by Messer et al. The EQI was developed for the period 2000– 2005 because it was the time period for which the most recent data were available when index construction was initiated. The EQI includes variables representing each of the environmental domains. The air domain includes 87 variables representing criteria and hazardous air pollutants. The water domain includes 80 variables representing overall water quality, general water contamination, recreational water quality, drinking water quality, atmospheric deposition, drought, and chemical contamination. The land domain includes 26 variables representing agriculture, pesticides, contaminants, facilities, and radon. The built domain includes 14 variables representing roads, highway/road safety, public transit behavior, business environment, and subsidized housing environment. The sociodemographic environment includes 12 variables representing socioeconomics and crime. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., L. Messer, K. Rappazzo , C. Gray, S. Grabich , and D. Lobdell. County-level environmental quality and associations with cancer incidence#. Cancer. John Wiley & Sons Incorporated, New York, NY, USA, 123(15): 2901-2908, (2017).