12 datasets found
  1. Cancer County-Level

    • kaggle.com
    zip
    Updated Dec 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Cancer County-Level [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-county-level-correlations-in-cancer-ra
    Explore at:
    zip(146998 bytes)Available download formats
    Dataset updated
    Dec 3, 2022
    Authors
    The Devastator
    Description

    Exploring County-Level Correlations in Cancer Rates and Trends

    A Multivariate Ordinary Least Squares Regression Model

    By Noah Rippner [source]

    About this dataset

    This dataset offers a unique opportunity to examine the pattern and trends of county-level cancer rates in the United States at the individual county level. Using data from cancer.gov and the US Census American Community Survey, this dataset allows us to gain insight into how age-adjusted death rate, average deaths per year, and recent trends vary between counties – along with other key metrics like average annual counts, met objectives of 45.5?, recent trends (2) in death rates, etc., captured within our deep multi-dimensional dataset. We are able to build linear regression models based on our data to determine correlations between variables that can help us better understand cancers prevalence levels across different counties over time - making it easier to target health initiatives and resources accurately when necessary or desired

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This kaggle dataset provides county-level datasets from the US Census American Community Survey and cancer.gov for exploring correlations between county-level cancer rates, trends, and mortality statistics. This dataset contains records from all U.S counties concerning the age-adjusted death rate, average deaths per year, recent trend (2) in death rates, average annual count of cases detected within 5 years, and whether or not an objective of 45.5 (1) was met in the county associated with each row in the table.

    To use this dataset to its fullest potential you need to understand how to perform simple descriptive analytics which includes calculating summary statistics such as mean, median or other numerical values; summarizing categorical variables using frequency tables; creating data visualizations such as charts and histograms; applying linear regression or other machine learning techniques such as support vector machines (SVMs), random forests or neural networks etc.; differentiating between supervised vs unsupervised learning techniques etc.; reviewing diagnostics tests to evaluate your models; interpreting your findings; hypothesizing possible reasons and patterns discovered during exploration made through data visualizations ; Communicating and conveying results found via effective presentation slides/documents etc.. Having this understanding will enable you apply different methods of analysis on this data set accurately ad effectively.

    Once these concepts are understood you are ready start exploring this data set by first importing it into your visualization software either tableau public/ desktop version/Qlikview / SAS Analytical suite/Python notebooks for building predictive models by loading specified packages based on usage like Scikit Learn if Python is used among others depending on what tool is used . Secondly a brief description of the entire table's column structure has been provided above . Statistical operations can be carried out with simple queries after proper knowledge of basic SQL commands is attained just like queries using sub sets can also be performed with good command over selecting columns while specifying conditions applicable along with sorting operations being done based on specific attributes as required leading up towards writing python codes needed when parsing specific portion of data desired grouping / aggregating different categories before performing any kind of predictions / models can also activated create post joining few tables possible , when ever necessary once again varying across tools being used Thereby diving deep into analyzing available features determined randomly thus creating correlation matrices figures showing distribution relationships using correlation & covariance matrixes , thus making evaluations deducing informative facts since revealing trends identified through corresponding scatter plots from a given metric gathered from appropriate fields!

    Research Ideas

    • Building a predictive cancer incidence model based on county-level demographic data to identify high-risk areas and target public health interventions.
    • Analyzing correlations between age-adjusted death rate, average annual count, and recent trends in order to develop more effective policy initiatives for cancer prevention and healthcare access.
    • Utilizing the dataset to construct a machine learning algorithm that can predict county-level mortality rates based on socio-economic factors such as poverty levels and educational attainment rates

    Acknowledgements

    If you use this dataset i...

  2. CDC WONDER: Cancer Statistics

    • catalog.data.gov
    • healthdata.gov
    • +4more
    Updated Jul 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention, Department of Health & Human Services (2025). CDC WONDER: Cancer Statistics [Dataset]. https://catalog.data.gov/dataset/cdc-wonder-cancer-statistics
    Explore at:
    Dataset updated
    Jul 29, 2025
    Description

    The United States Cancer Statistics (USCS) online databases in WONDER provide cancer incidence and mortality data for the United States for the years since 1999, by year, state and metropolitan areas (MSA), age group, race, ethnicity, sex, childhood cancer classifications and cancer site. Report case counts, deaths, crude and age-adjusted incidence and death rates, and 95% confidence intervals for rates. The USCS data are the official federal statistics on cancer incidence from registries having high-quality data and cancer mortality statistics for 50 states and the District of Columbia. USCS are produced by the Centers for Disease Control and Prevention (CDC) and the National Cancer Institute (NCI), in collaboration with the North American Association of Central Cancer Registries (NAACCR). Mortality data are provided by the Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS), National Vital Statistics System (NVSS).

  3. Cancer data of United States of America

    • kaggle.com
    zip
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanisha1604 (2024). Cancer data of United States of America [Dataset]. https://www.kaggle.com/datasets/tanisha1604/cancer-data-of-united-states-of-america
    Explore at:
    zip(346754 bytes)Available download formats
    Dataset updated
    Apr 18, 2024
    Authors
    Tanisha1604
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United States
    Description

    About Dataset

    The dataset contains 2 .csv files This file contains various demographic and health-related data for different regions. Here's a brief description of each column:

    File 1st

    • avganncount: Average number of cancer cases diagnosed annually.

    • avgdeathsperyear: Average number of deaths due to cancer per year.

    • target_deathrate: Target death rate due to cancer.

    • incidencerate: Incidence rate of cancer.

    • medincome: Median income in the region.

    • popest2015: Estimated population in 2015.

    • povertypercent: Percentage of population below the poverty line.

    • studypercap: Per capita number of cancer-related clinical trials conducted.

    • binnedinc: Binned median income.

    • medianage: Median age in the region.

    • pctprivatecoveragealone: Percentage of population covered by private health insurance alone.

    • pctempprivcoverage: Percentage of population covered by employee-provided private health insurance.

    • pctpubliccoverage: Percentage of population covered by public health insurance.

    • pctpubliccoveragealone: Percentage of population covered by public health insurance only.

    • pctwhite: Percentage of White population.

    • pctblack: Percentage of Black population.

    • pctasian: Percentage of Asian population.

    • pctotherrace: Percentage of population belonging to other races.

    • pctmarriedhouseholds: Percentage of married households. birthrate: Birth rate in the region.

    File 2nd

    This file contains demographic information about different regions, including details about household size and geographical location. Here's a description of each column:

    • statefips: The FIPS code representing the state.

    • countyfips: The FIPS code representing the county or census area within the state.

    • avghouseholdsize: The average household size in the region.

    • geography: The geographical location, typically represented as the county or census area name followed by the state name.

    Each row in the file represents a specific region, providing details about household size and geographical location. This information can be used for various demographic analyses and studies.

  4. p

    Cervical Cancer Risk Classification - Dataset - CKAN

    • data.poltekkes-smg.ac.id
    Updated Oct 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Cervical Cancer Risk Classification - Dataset - CKAN [Dataset]. https://data.poltekkes-smg.ac.id/dataset/cervical-cancer-risk-classification
    Explore at:
    Dataset updated
    Oct 7, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical Cancer Risk Factors for Biopsy: This Dataset is Obtained from UCI Repository and kindly acknowledged! This file contains a List of Risk Factors for Cervical Cancer leading to a Biopsy Examination! About 11,000 new cases of invasive cervical cancer are diagnosed each year in the U.S. However, the number of new cervical cancer cases has been declining steadily over the past decades. Although it is the most preventable type of cancer, each year cervical cancer kills about 4,000 women in the U.S. and about 300,000 women worldwide. In the United States, cervical cancer mortality rates plunged by 74% from 1955 - 1992 thanks to increased screening and early detection with the Pap test. AGE Fifty percent of cervical cancer diagnoses occur in women ages 35 - 54, and about 20% occur in women over 65 years of age. The median age of diagnosis is 48 years. About 15% of women develop cervical cancer between the ages of 20 - 30. Cervical cancer is extremely rare in women younger than age 20. However, many young women become infected with multiple types of human papilloma virus, which then can increase their risk of getting cervical cancer in the future. Young women with early abnormal changes who do not have regular examinations are at high risk for localized cancer by the time they are age 40, and for invasive cancer by age 50. SOCIOECONOMIC AND ETHNIC FACTORS Although the rate of cervical cancer has declined among both Caucasian and African-American women over the past decades, it remains much more prevalent in African-Americans -- whose death rates are twice as high as Caucasian women. Hispanic American women have more than twice the risk of invasive cervical cancer as Caucasian women, also due to a lower rate of screening. These differences, however, are almost certainly due to social and economic differences. Numerous studies report that high poverty levels are linked with low screening rates. In addition, lack of health insurance, limited transportation, and language difficulties hinder a poor woman’s access to screening services. HIGH SEXUAL ACTIVITY Human papilloma virus (HPV) is the main risk factor for cervical cancer. In adults, the most important risk factor for HPV is sexual activity with an infected person. Women most at risk for cervical cancer are those with a history of multiple sexual partners, sexual intercourse at age 17 years or younger, or both. A woman who has never been sexually active has a very low risk for developing cervical cancer. Sexual activity with multiple partners increases the likelihood of many other sexually transmitted infections (chlamydia, gonorrhea, syphilis).Studies have found an association between chlamydia and cervical cancer risk, including the possibility that chlamydia may prolong HPV infection. FAMILY HISTORY Women have a higher risk of cervical cancer if they have a first-degree relative (mother, sister) who has had cervical cancer. USE OF ORAL CONTRACEPTIVES Studies have reported a strong association between cervical cancer and long-term use of oral contraception (OC). Women who take birth control pills for more than 5 - 10 years appear to have a much higher risk HPV infection (up to four times higher) than those who do not use OCs. (Women taking OCs for fewer than 5 years do not have a significantly higher risk.) The reasons for this risk from OC use are not entirely clear. Women who use OCs may be less likely to use a diaphragm, condoms, or other methods that offer some protection against sexual transmitted diseases, including HPV. Some research also suggests that the hormones in OCs might help the virus enter the genetic material of cervical cells. HAVING MANY CHILDREN Studies indicate that having many children increases the risk for developing cervical cancer, particularly in women infected with HPV. SMOKING Smoking is associated with a higher risk for precancerous changes (dysplasia) in the cervix and for progression to invasive cervical cancer, especially for women infected with HPV. IMMUNOSUPPRESSION Women with weak immune systems, (such as those with HIV / AIDS), are more susceptible to acquiring HPV. Immunocompromised patients are also at higher risk for having cervical precancer develop rapidly into invasive cancer. DIETHYLSTILBESTROL (DES) From 1938 - 1971, diethylstilbestrol (DES), an estrogen-related drug, was widely prescribed to pregnant women to help prevent miscarriages. The daughters of these women face a higher risk for cervical cancer. DES is no longer prsecribed.

  5. County Cancer Death Rates

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). County Cancer Death Rates [Dataset]. https://www.kaggle.com/datasets/thedevastator/county-cancer-death-rates/discussion
    Explore at:
    zip(883348 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    The Devastator
    Description

    County Cancer Death Rates

    County-level cancer death rates with related variables

    By Noah Rippner [source]

    About this dataset

    This dataset provides comprehensive information on county-level cancer death and incidence rates, as well as various related variables. It includes data on age-adjusted death rates, average deaths per year, recent trends in cancer death rates, recent 5-year trends in death rates, and average annual counts of cancer deaths or incidence. The dataset also includes the federal information processing standards (FIPS) codes for each county.

    Additionally, the dataset indicates whether each county met the objective of a targeted death rate of 45.5. The recent trend in cancer deaths or incidence is also captured for analysis purposes.

    The purpose of the death.csv file within this dataset is to offer detailed information specifically concerning county-level cancer death rates and related variables. On the other hand, the incd.csv file contains data on county-level cancer incidence rates and additional relevant variables.

    To provide more context and understanding about the included data points, there is a separate file named cancer_data_notes.csv. This file serves to provide informative notes and explanations regarding the various aspects of the cancer data used in this dataset.

    Please note that this particular description provides an overview for a linear regression walkthrough using this dataset based on Python programming language. It highlights how to source and import the data properly before moving into data preparation steps such as exploratory analysis. The walkthrough further covers model selection and important model diagnostics measures.

    It's essential to bear in mind that this example serves as an initial attempt at creating a multivariate Ordinary Least Squares regression model using these datasets from various sources like cancer.gov along with US Census American Community Survey data. This baseline model allows easy comparisons with future iterations intended for improvements or refinements.

    Important columns found within this extensively documented Kaggle dataset include County names along with their corresponding FIPS codes—a standardized coding system by Federal Information Processing Standards (FIPS). Moreover,Met Objective of 45.5? (1) column denotes whether a specific county achieved the targeted objective of a death rate of 45.5 or not.

    Overall, this dataset aims to offer valuable insights into county-level cancer death and incidence rates across various regions, providing policymakers, researchers, and healthcare professionals with essential information for analysis and decision-making purposes

    How to use the dataset

    • Familiarize Yourself with the Columns:

      • County: The name of the county.
      • FIPS: The Federal Information Processing Standards code for the county.
      • Met Objective of 45.5? (1): Indicates whether the county met the objective of a death rate of 45.5 (Boolean).
      • Age-Adjusted Death Rate: The age-adjusted death rate for cancer in the county.
      • Average Deaths per Year: The average number of deaths per year due to cancer in the county.
      • Recent Trend (2): The recent trend in cancer death rates/incidence in the county.
      • Recent 5-Year Trend (2) in Death Rates: The recent 5-year trend in cancer death rates/incidence in the county.
      • Average Annual Count: The average annual count of cancer deaths/incidence in the county.
    • Determine Counties Meeting Objective: Use this dataset to identify counties that have met or not met an objective death rate threshold of 45.5%. Look for entries where Met Objective of 45.5? (1) is marked as True or False.

    • Analyze Age-Adjusted Death Rates: Study and compare age-adjusted death rates across different counties using Age-Adjusted Death Rate values provided as floats.

    • Explore Average Deaths per Year: Examine and compare average annual counts and trends regarding deaths caused by cancer, using Average Deaths per Year as a reference point.

    • Investigate Recent Trends: Assess recent trends related to cancer deaths or incidence by analyzing data under columns such as Recent Trend, Recent Trend (2), and Recent 5-Year Trend (2) in Death Rates. These columns provide information on how cancer death rates/incidence have changed over time.

    • Compare Counties: Utilize this dataset to compare counties based on their cancer death rates and related variables. Identify counties with lower or higher average annual counts, age-adjusted death rates, or recent trends to analyze and understand the factors contributing ...

  6. h

    Subtypes of Native American ancestry and leading causes of death: Mapuche...

    • heidata.uni-heidelberg.de
    txt
    Updated Oct 24, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justo Lorenzo Bermejo; Felix Boekstegers; Rosa González Silos; Katherine Marcelain; Pablo Baez Benavides; Carol Barahona Ponce; Bettina Müller; Catterina Ferreccio; Jill Koshiol; Christine Fischer; Barbara Peil; Janet Sinsheimer; Macarena Fuentes Guajardo; Olga Barajas; Rolando Gonzalez-Jose; Gabriel Bedoya; Maria Cátira Bortolini; Samuel Canizales-Quinteros; Carla Gallo; Andres Ruiz Linares; Francisco Rothhammer; Justo Lorenzo Bermejo; Felix Boekstegers; Rosa González Silos; Katherine Marcelain; Pablo Baez Benavides; Carol Barahona Ponce; Bettina Müller; Catterina Ferreccio; Jill Koshiol; Christine Fischer; Barbara Peil; Janet Sinsheimer; Macarena Fuentes Guajardo; Olga Barajas; Rolando Gonzalez-Jose; Gabriel Bedoya; Maria Cátira Bortolini; Samuel Canizales-Quinteros; Carla Gallo; Andres Ruiz Linares; Francisco Rothhammer (2018). Subtypes of Native American ancestry and leading causes of death: Mapuche ancestry-specific associations with gallbladder cancer risk in Chile [Dataset] [Dataset]. http://doi.org/10.11588/DATA/IDSI88
    Explore at:
    txt(263073), txt(36100)Available download formats
    Dataset updated
    Oct 24, 2018
    Dataset provided by
    heiDATA
    Authors
    Justo Lorenzo Bermejo; Felix Boekstegers; Rosa González Silos; Katherine Marcelain; Pablo Baez Benavides; Carol Barahona Ponce; Bettina Müller; Catterina Ferreccio; Jill Koshiol; Christine Fischer; Barbara Peil; Janet Sinsheimer; Macarena Fuentes Guajardo; Olga Barajas; Rolando Gonzalez-Jose; Gabriel Bedoya; Maria Cátira Bortolini; Samuel Canizales-Quinteros; Carla Gallo; Andres Ruiz Linares; Francisco Rothhammer; Justo Lorenzo Bermejo; Felix Boekstegers; Rosa González Silos; Katherine Marcelain; Pablo Baez Benavides; Carol Barahona Ponce; Bettina Müller; Catterina Ferreccio; Jill Koshiol; Christine Fischer; Barbara Peil; Janet Sinsheimer; Macarena Fuentes Guajardo; Olga Barajas; Rolando Gonzalez-Jose; Gabriel Bedoya; Maria Cátira Bortolini; Samuel Canizales-Quinteros; Carla Gallo; Andres Ruiz Linares; Francisco Rothhammer
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/IDSI88https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/IDSI88

    Area covered
    Chile
    Description

    Latin Americans are highly heterogeneous regarding the type of Native American ancestry. Consideration of specific associations with common diseases may lead to substantial advances in unraveling of disease etiology and disease prevention. Here we investigate possible associations between the type of Native American ancestry and leading causes of death. After an aggregate-data study based on genome-wide genotype data from 1805 admixed Chileans and 639,789 deaths, we validate an identified association with gallbladder cancer relying on individual data from 64 gallbladder cancer patients, with and without a family history, and 170 healthy controls. Native American proportions were markedly underestimated when the two main types of Native American ancestry in Chile, originated from the Mapuche and Aymara indigenous peoples, were combined together. Consideration of the type of Native American ancestry was crucial to identify disease associations. Native American ancestry showed no association with gallbladder cancer mortality (P = 0.26). By contrast, each 1% increase in the Mapuche proportion represented a 3.7% increased mortality risk by gallbladder cancer (95%CI 3.1–4.3%, P = 6×10−27). Individual-data results and extensive sensitivity analyses confirmed the association between Mapuche ancestry and gallbladder cancer. Increasing Mapuche proportions were also associated with an increased mortality due to asthma and, interestingly, with a decreased mortality by diabetes. The mortality due to skin, bladder, larynx, bronchus and lung cancers increased with increasing Aymara proportions. Described methods should be considered in future studies on human population genetics and human health. Complementary individual-based studies are needed to apportion the genetic and non-genetic components of associations identified relying on aggregate-data.

  7. Cancer Regression

    • kaggle.com
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varun Raskar (2024). Cancer Regression [Dataset]. https://www.kaggle.com/datasets/varunraskar/cancer-regression
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Varun Raskar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset contains 2 .csv files

    This file contains various demographic and health-related data for different regions. Here's a brief description of each column:

    File 1st

    avganncount: Average number of cancer cases diagnosed annually.

    avgdeathsperyear: Average number of deaths due to cancer per year.

    target_deathrate: Target death rate due to cancer.

    incidencerate: Incidence rate of cancer.

    medincome: Median income in the region.

    popest2015: Estimated population in 2015.

    povertypercent: Percentage of population below the poverty line.

    studypercap: Per capita number of cancer-related clinical trials conducted.

    binnedinc: Binned median income.

    medianage: Median age in the region.

    pctprivatecoveragealone: Percentage of population covered by private health insurance alone.

    pctempprivcoverage: Percentage of population covered by employee-provided private health insurance.

    pctpubliccoverage: Percentage of population covered by public health insurance.

    pctpubliccoveragealone: Percentage of population covered by public health insurance only.

    pctwhite: Percentage of White population.

    pctblack: Percentage of Black population.

    pctasian: Percentage of Asian population.

    pctotherrace: Percentage of population belonging to other races.

    pctmarriedhouseholds: Percentage of married households. birthrate: Birth rate in the region.

    File 2nd

    This file contains demographic information about different regions, including details about household size and geographical location. Here's a description of each column:

    statefips: The FIPS code representing the state.

    countyfips: The FIPS code representing the county or census area within the state.

    avghouseholdsize: The average household size in the region.

    geography: The geographical location, typically represented as the county or census area name followed by the state name.

    Each row in the file represents a specific region, providing details about household size and geographical location. This information can be used for various demographic analyses and studies.

  8. DICOM converted Slide Microscopy images for the TCGA-CHOL collection

    • zenodo.org
    bin
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim (2024). DICOM converted Slide Microscopy images for the TCGA-CHOL collection [Dataset]. http://doi.org/10.5281/zenodo.13346204
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-CHOL. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

    Collection description

    Cholangiocarcinoma is a cancer that develops in the bile duct. The bile duct is a network of tubes that carry bile from the liver and gallbladder to the small intestine. Tumors that start in bile duct branches that lie inside the liver are called intrahepatic bile duct cancer, while those that form outside the liver are called extrahepatic bile duct cancer. About 10% of all cholangiocarcinoma are intrahepatic and 90% are extrahepatic. TCGA studied both subtypes of cholangiocarcinoma.

    Although cholangiocarcinoma is a rare cancer, the incidence and mortality rates for the disease have been increasing worldwide in the last three decades. Between 2,000 and 3,000 Americans are diagnosed with cholangiocarcinoma each year, the majority of them with tumors at advanced stages. This cancer is more prevalent in Asia and the Middle East, where parasitic infection of the bile duct increases the risk of cholangiocarcinoma. Other diseases of the bile duct or liver, such as bile duct stones and liver disease, obesity, diabetes, and smoking are also risk factors. When intrahepatic and extrahepatic cholangiocarcinoma spread to other parts of the body, only 2% of patients survive five years after diagnosis.

    Please see the TCGA-CHOL information page to learn more about the images and to obtain any supporting metadata for this collection.

    Citation guidelines can be found on the Citing TCGA in Publications and Presentations information page.

    Files included

    A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

    1. tcga_chol-idc_v10-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services buckets
    2. tcga_chol-idc_v10-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage buckets
    3. tcga_chol-idc_v10-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)

    Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

    Download instructions

    Each of the manifests include instructions in the header on how to download the included files.

    To download the files using .s5cmd manifests:

    1. install idc-index package: pip install --upgrade idc-index
    2. download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd.

    To download the files using .dcf manifest, see manifest header.

    Acknowledgments

    Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

    References

    [1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180

  9. RSNA Screening Mammography Breast Cancer Detection (RSNA-SMBC) Dataset

    • registry.opendata.aws
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radiological Society of North America (https://www.rsna.org/) (2024). RSNA Screening Mammography Breast Cancer Detection (RSNA-SMBC) Dataset [Dataset]. https://registry.opendata.aws/rsna-screening-mammography-breast-cancer-detection/
    Explore at:
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Radiological Society of North America
    Description

    According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requires the expertise of highly-trained human observers, making screening mammography programs expensive to conduct. RSNA collected screening mammograms and supporting information from two sites, totaling just under 20,000 imaging studies.

  10. Table_3_Comparison of Radiomic Features in a Diverse Cohort of Patients With...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer B. Permuth; Shraddha Vyas; Jiannong Li; Dung-Tsa Chen; Daniel Jeong; Jung W. Choi (2023). Table_3_Comparison of Radiomic Features in a Diverse Cohort of Patients With Pancreatic Ductal Adenocarcinomas.xlsx [Dataset]. http://doi.org/10.3389/fonc.2021.712950.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Jennifer B. Permuth; Shraddha Vyas; Jiannong Li; Dung-Tsa Chen; Daniel Jeong; Jung W. Choi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundSignificant racial disparities in pancreatic cancer incidence and mortality rates exist, with the highest rates in African Americans compared to Non-Hispanic Whites and Hispanic/Latinx populations. Computer-derived quantitative imaging or “radiomic” features may serve as non-invasive surrogates for underlying biological factors and heterogeneity that characterize pancreatic tumors from African Americans, yet studies are lacking in this area. The objective of this pilot study was to determine if the radiomic tumor profile extracted from pretreatment computed tomography (CT) images differs between African Americans, Non-Hispanic Whites, and Hispanic/Latinx with pancreatic cancer.MethodsWe evaluated a retrospective cohort of 71 pancreatic cancer cases (23 African American, 33 Non-Hispanic White, and 15 Hispanic/Latinx) who underwent pretreatment CT imaging at Moffitt Cancer Center and Research Institute. Whole lesion semi-automated segmentation was performed on each slice of the lesion on all pretreatment venous phase CT exams using Healthmyne Software (Healthmyne, Madison, WI, USA) to generate a volume of interest. To reduce feature dimensionality, 135 highly relevant non-texture and texture features were extracted from each segmented lesion and analyzed for each volume of interest.ResultsThirty features were identified and significantly associated with race/ethnicity based on Kruskal-Wallis test. Ten of the radiomic features were highly associated with race/ethnicity independent of tumor grade, including sphericity, volumetric mean Hounsfield units (HU), minimum HU, coefficient of variation HU, four gray level texture features, and two wavelet texture features. A radiomic signature summarized by the first principal component partially differentiated African American from non-African American tumors (area underneath the curve = 0.80). Poorer survival among African Americans compared to Non-African Americans was observed for tumors with lower volumetric mean CT [HR: 3.90 (95% CI:1.19–12.78), p=0.024], lower GLCM Avg Column Mean [HR:4.75 (95% CI: 1.44,15.37), p=0.010], and higher GLCM Cluster Tendency [HR:3.36 (95% CI: 1.06–10.68), p=0.040], and associations persisted in volumetric mean CT and GLCM Avg Column after adjustment for key clinicopathologic factors.ConclusionsThis pilot study identified several textural radiomics features associated with poor overall survival among African Americans with PDAC, independent of other prognostic factors such as grade. Our findings suggest that CT radiomic features may serve as surrogates for underlying biological factors and add value in predicting clinical outcomes when integrated with other parameters in ongoing and future studies of cancer health disparities.

  11. Validation of a genetic risk score for Arkansas women of color

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Athena Starlard-Davenport; Richard Allman; Gillian S. Dite; John L. Hopper; Erika Spaeth Tuff; Stewart Macleod; Susan Kadlubar; Michael Preston; Ronda Henry-Tillman (2023). Validation of a genetic risk score for Arkansas women of color [Dataset]. http://doi.org/10.1371/journal.pone.0204834
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Athena Starlard-Davenport; Richard Allman; Gillian S. Dite; John L. Hopper; Erika Spaeth Tuff; Stewart Macleod; Susan Kadlubar; Michael Preston; Ronda Henry-Tillman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    African American women in the state of Arkansas have high breast cancer mortality rates. Breast cancer risk assessment tools developed for African American underestimate breast cancer risk. Combining African American breast cancer associated single-nucleotide polymorphisms (SNPs) into breast cancer risk algorithms may improve individualized estimates of a woman’s risk of developing breast cancer and enable improved recommendation of screening and chemoprevention for women at high risk. The goal of this study was to confirm with an independent dataset consisting of Arkansas women of color, whether a genetic risk score derived from common breast cancer susceptibility SNPs can be combined with a clinical risk estimate provided by the Breast Cancer Risk Assessment Tool (BCRAT) to produce a more accurate individualized breast cancer risk estimate. A population-based cohort of African American women representative of Arkansas consisted of 319 cases and 559 controls for this study. Five-year and lifetime risks from the BCRAT were measured and combined with a risk score based on 75 independent susceptibility SNPs in African American women. We used the odds ratio (OR) per adjusted standard deviation to evaluate the improvement in risk estimates produced by combining the polygenic risk score (PRS) with 5-year and lifetime risk scores estimated using BCRAT. For 5-year risk OR per standard deviation increased from 1.84 to 2.08 with the addition of the polygenic risk score and from 1.79 to 2.07 for the lifetime risk score. Reclassification analysis indicated that 13% of cases had their 5-year risk increased above the 1.66% guideline threshold (NRI = 0.020 (95% CI -0.040, 0.080)) and 6.3% of cases had their lifetime risk increased above the 20% guideline threshold by the addition of the polygenic risk score (NRI = 0.034 (95% CI 0.000, 0.070)). Our data confirmed that discriminatory accuracy of BCRAT is improved for African American women in Arkansas with the inclusion of specific SNP breast cancer risk alleles.

  12. Data_Sheet_2_Association of Cancer Diagnosis and Therapeutic Stage With...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Ángel Dominguez-Rojas; Pablo Vásquez-Hoyos; Rodrigo Pérez-Morales; Ana María Monsalve-Quintero; Lupe Mora-Robles; Alejandro Diaz-Diaz; Silvio Fabio Torres; Ángel Castro-Dajer; Lizeth Yuliana Cabanillas-Burgos; Vladimir Aguilera-Avendaño; Edwin Mauricio Cantillano-Quintero; Anna Camporesi; Asya Agulnik; Sheena Mukkada; Giancarlo Alvarado-Gamarra; Ninoska Rojas-Soto; Ana Luisa Mendieta-Zevallos; Mariela Violeta Tello-Pezo; Liliana Vásquez-Ponce; Rubén Eduardo Lasso-Palomino; María Camila Pérez-Arroyave; Mónica Trujillo-Honeysberg; Juan Gonzalo Mesa-Monsalve; Carlos Alberto Pardo González; Juan Francisco López Cubillos; Sebastián Gonzalez-Dambrauskas; Alvaro Coronado-Munoz (2023). Data_Sheet_2_Association of Cancer Diagnosis and Therapeutic Stage With Mortality in Pediatric Patients With COVID-19, Prospective Multicenter Cohort Study From Latin America.pdf [Dataset]. http://doi.org/10.3389/fped.2022.885633.s002
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Jesus Ángel Dominguez-Rojas; Pablo Vásquez-Hoyos; Rodrigo Pérez-Morales; Ana María Monsalve-Quintero; Lupe Mora-Robles; Alejandro Diaz-Diaz; Silvio Fabio Torres; Ángel Castro-Dajer; Lizeth Yuliana Cabanillas-Burgos; Vladimir Aguilera-Avendaño; Edwin Mauricio Cantillano-Quintero; Anna Camporesi; Asya Agulnik; Sheena Mukkada; Giancarlo Alvarado-Gamarra; Ninoska Rojas-Soto; Ana Luisa Mendieta-Zevallos; Mariela Violeta Tello-Pezo; Liliana Vásquez-Ponce; Rubén Eduardo Lasso-Palomino; María Camila Pérez-Arroyave; Mónica Trujillo-Honeysberg; Juan Gonzalo Mesa-Monsalve; Carlos Alberto Pardo González; Juan Francisco López Cubillos; Sebastián Gonzalez-Dambrauskas; Alvaro Coronado-Munoz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Latin America
    Description

    BackgroundChildren with cancer are at risk of critical disease and mortality from COVID-19 infection. In this study, we describe the clinical characteristics of pediatric patients with cancer and COVID-19 from multiple Latin American centers and risk factors associated with mortality in this population.MethodsThis study is a multicenter, prospective cohort study conducted at 12 hospitals from 6 Latin American countries (Argentina, Bolivia, Colombia, Ecuador, Honduras and Peru) from April to November 2021. Patients younger than 14 years of age that had an oncological diagnosis and COVID-19 or multisystemic inflammatory syndrome in children (MIS-C) who were treated in the inpatient setting were included. The primary exposure was the diagnosis and treatment status, and the primary outcome was mortality. We defined “new diagnosis” as patients with no previous diagnosis of cancer, “established diagnosis” as patients with cancer and ongoing treatment and “relapse” as patients with cancer and ongoing treatment that had a prior cancer-free period. A frequentist analysis was performed including a multivariate logistic regression for mortality.ResultsTwo hundred and ten patients were included in the study; 30 (14%) died during the study period and 67% of patients who died were admitted to critical care. Demographics were similar in survivors and non-survivors. Patients with low weight for age (

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Cancer County-Level [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-county-level-correlations-in-cancer-ra
Organization logo

Cancer County-Level

Study country level cancer correlations

Explore at:
21 scholarly articles cite this dataset (View in Google Scholar)
zip(146998 bytes)Available download formats
Dataset updated
Dec 3, 2022
Authors
The Devastator
Description

Exploring County-Level Correlations in Cancer Rates and Trends

A Multivariate Ordinary Least Squares Regression Model

By Noah Rippner [source]

About this dataset

This dataset offers a unique opportunity to examine the pattern and trends of county-level cancer rates in the United States at the individual county level. Using data from cancer.gov and the US Census American Community Survey, this dataset allows us to gain insight into how age-adjusted death rate, average deaths per year, and recent trends vary between counties – along with other key metrics like average annual counts, met objectives of 45.5?, recent trends (2) in death rates, etc., captured within our deep multi-dimensional dataset. We are able to build linear regression models based on our data to determine correlations between variables that can help us better understand cancers prevalence levels across different counties over time - making it easier to target health initiatives and resources accurately when necessary or desired

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This kaggle dataset provides county-level datasets from the US Census American Community Survey and cancer.gov for exploring correlations between county-level cancer rates, trends, and mortality statistics. This dataset contains records from all U.S counties concerning the age-adjusted death rate, average deaths per year, recent trend (2) in death rates, average annual count of cases detected within 5 years, and whether or not an objective of 45.5 (1) was met in the county associated with each row in the table.

To use this dataset to its fullest potential you need to understand how to perform simple descriptive analytics which includes calculating summary statistics such as mean, median or other numerical values; summarizing categorical variables using frequency tables; creating data visualizations such as charts and histograms; applying linear regression or other machine learning techniques such as support vector machines (SVMs), random forests or neural networks etc.; differentiating between supervised vs unsupervised learning techniques etc.; reviewing diagnostics tests to evaluate your models; interpreting your findings; hypothesizing possible reasons and patterns discovered during exploration made through data visualizations ; Communicating and conveying results found via effective presentation slides/documents etc.. Having this understanding will enable you apply different methods of analysis on this data set accurately ad effectively.

Once these concepts are understood you are ready start exploring this data set by first importing it into your visualization software either tableau public/ desktop version/Qlikview / SAS Analytical suite/Python notebooks for building predictive models by loading specified packages based on usage like Scikit Learn if Python is used among others depending on what tool is used . Secondly a brief description of the entire table's column structure has been provided above . Statistical operations can be carried out with simple queries after proper knowledge of basic SQL commands is attained just like queries using sub sets can also be performed with good command over selecting columns while specifying conditions applicable along with sorting operations being done based on specific attributes as required leading up towards writing python codes needed when parsing specific portion of data desired grouping / aggregating different categories before performing any kind of predictions / models can also activated create post joining few tables possible , when ever necessary once again varying across tools being used Thereby diving deep into analyzing available features determined randomly thus creating correlation matrices figures showing distribution relationships using correlation & covariance matrixes , thus making evaluations deducing informative facts since revealing trends identified through corresponding scatter plots from a given metric gathered from appropriate fields!

Research Ideas

  • Building a predictive cancer incidence model based on county-level demographic data to identify high-risk areas and target public health interventions.
  • Analyzing correlations between age-adjusted death rate, average annual count, and recent trends in order to develop more effective policy initiatives for cancer prevention and healthcare access.
  • Utilizing the dataset to construct a machine learning algorithm that can predict county-level mortality rates based on socio-economic factors such as poverty levels and educational attainment rates

Acknowledgements

If you use this dataset i...

Search
Clear search
Close search
Google apps
Main menu