Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
SummaryThis metadata record provides details of the data supporting the related manuscript: “Breast cancer gene expression datasets do not reflect the disease at the population level”. The related study aimed to determine how representative publicly available tumor gene expression datasets are of clinical populations.As the data are all publicly available in appropriate community repositories, no primary data is included with this metadata record. Instead, the attached spreadsheet lists the 70 publicly available datasets, along with their respective details, including the repositories in which they are stored and their accession numbers. The 70 datasets represent 16,130 breast carcinomas.Data accessAll of the gene expression datasets analysed in the study are already publicly available, and their accession numbers and original publication references are listed in the Supplementary Table included with this metadata record.The 70 publicly available datasets were identified in the public domain when restricting the search to those studies representing a minimum of 50 breast cancer patients with primary tumours.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAn independent, not-for-profit consortium to accelerate research, and improve treatment for patients affected with the most commonly-diagnosed cancers in Asia by generating a genomic data resource for the most prevalent cancers in Asia. ACRG is focusing its initial efforts on Asian liver, gastric and lung cancers. Goals * Generate comprehensive genomics data sets for Asia-prevalent cancers * Conduct all research under good clinical practices and in accordance with local laws * Uncover key mutations and pathways for developing targeted therapies * Discover molecular tumor classifiers for patient stratification * Discover prognostic markers to identify high-risk patients * Freely share resulting raw data with scientific community to empower researchers globally and enable development of new diagnostics and medicines * Publish data analysis results jointly in prominent scientific journals Over the next two years, Lilly, Merck and Pfizer have committed to create an extensive pharmacogenomic cancer database that will be composed of data from approximately 2,000 tissue samples from patients with lung and gastric cancer that will be made publicly available to researchers and, over time, further populated with clinical data from a longitudinal analysis of patients. Comparison of the contrasting genomic signatures of these cancers could inform new approaches to treatment. Lilly has assumed responsibility for ultimately providing the data to the research public through an open-source concept managed by Lilly''''s Singapore research site. Moreover, Lilly, Merck and Pfizer will each provide technical and intellectual expertise. One dataset can be found at http://gigadb.org/dataset/100034
Facebook
TwitterBy Noah Rippner [source]
This dataset offers a unique opportunity to examine the pattern and trends of county-level cancer rates in the United States at the individual county level. Using data from cancer.gov and the US Census American Community Survey, this dataset allows us to gain insight into how age-adjusted death rate, average deaths per year, and recent trends vary between counties – along with other key metrics like average annual counts, met objectives of 45.5?, recent trends (2) in death rates, etc., captured within our deep multi-dimensional dataset. We are able to build linear regression models based on our data to determine correlations between variables that can help us better understand cancers prevalence levels across different counties over time - making it easier to target health initiatives and resources accurately when necessary or desired
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This kaggle dataset provides county-level datasets from the US Census American Community Survey and cancer.gov for exploring correlations between county-level cancer rates, trends, and mortality statistics. This dataset contains records from all U.S counties concerning the age-adjusted death rate, average deaths per year, recent trend (2) in death rates, average annual count of cases detected within 5 years, and whether or not an objective of 45.5 (1) was met in the county associated with each row in the table.
To use this dataset to its fullest potential you need to understand how to perform simple descriptive analytics which includes calculating summary statistics such as mean, median or other numerical values; summarizing categorical variables using frequency tables; creating data visualizations such as charts and histograms; applying linear regression or other machine learning techniques such as support vector machines (SVMs), random forests or neural networks etc.; differentiating between supervised vs unsupervised learning techniques etc.; reviewing diagnostics tests to evaluate your models; interpreting your findings; hypothesizing possible reasons and patterns discovered during exploration made through data visualizations ; Communicating and conveying results found via effective presentation slides/documents etc.. Having this understanding will enable you apply different methods of analysis on this data set accurately ad effectively.
Once these concepts are understood you are ready start exploring this data set by first importing it into your visualization software either tableau public/ desktop version/Qlikview / SAS Analytical suite/Python notebooks for building predictive models by loading specified packages based on usage like Scikit Learn if Python is used among others depending on what tool is used . Secondly a brief description of the entire table's column structure has been provided above . Statistical operations can be carried out with simple queries after proper knowledge of basic SQL commands is attained just like queries using sub sets can also be performed with good command over selecting columns while specifying conditions applicable along with sorting operations being done based on specific attributes as required leading up towards writing python codes needed when parsing specific portion of data desired grouping / aggregating different categories before performing any kind of predictions / models can also activated create post joining few tables possible , when ever necessary once again varying across tools being used Thereby diving deep into analyzing available features determined randomly thus creating correlation matrices figures showing distribution relationships using correlation & covariance matrixes , thus making evaluations deducing informative facts since revealing trends identified through corresponding scatter plots from a given metric gathered from appropriate fields!
- Building a predictive cancer incidence model based on county-level demographic data to identify high-risk areas and target public health interventions.
- Analyzing correlations between age-adjusted death rate, average annual count, and recent trends in order to develop more effective policy initiatives for cancer prevention and healthcare access.
- Utilizing the dataset to construct a machine learning algorithm that can predict county-level mortality rates based on socio-economic factors such as poverty levels and educational attainment rates
If you use this dataset i...
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Note - This is part 2 of the dataset.
Part 1 can be found at : https://zenodo.org/records/13799069
Part 2 can be found at : https://zenodo.org/records/12784601
Part 3 can be found at : https://zenodo.org/records/14659131
Background: Lung cancer risk classification is an increasingly important area of research as low-dose thoracic CT screening programs have become standard of care for patients at high risk for lung cancer. There is limited availability of large, annotated public databases for the training and testing of algorithms for lung nodule classification.
Methods: Screening chest CT scans done between January 1, 2015 and June 30, 2021 at Duke University Health System were considered for this study. Efficient nodule annotation was performed semi-automatically by using a publicly available deep learning nodule detection algorithm trained on the LUNA16 dataset to identify initial candidates, which were then accepted based on nodule location in the radiology text report or manually annotated by a medical student and a fellowship-trained cardiothoracic radiologist.
Results: The dataset contains 1613 CT volumes with 2487 annotated nodules, selected from a total dataset of 2061 patients, with the remaining data reserved for future testing. Radiologist spot-checking confirmed the semi-automated annotation had an accuracy rate of >90%.
Conclusions: The Duke Lung Cancer Screening Dataset 2024 is the first large dataset for CT screening for lung cancer reflecting the use of current CT technology. This represents a useful resource of lung cancer risk classification research, and the efficient annotation methods described for its creation may be used to generate similar databases for research in the future.
Dataset part Details:
Part 1: DLCS subset 1 to 7 and, metadata and Annotations.
Part 2: DLCS subset 8,9 and CT image info metadata.
Part 3: DLCS subset 10.
Updates and Versions:
Code Repository:
To support reproducible open-access research and benchmarking, we have shared several pre-trained models and baseline results in a GitHub and GitLab repository.
GitLab: https://gitlab.oit.duke.edu/cvit-public/ai_lung_health_benchmarking
GitHub: https://github.com/fitushar/AI-in-Lung-Health-Benchmarking-Detection-and-Diagnostic-Models-Across-Multiple-CT-Scan-Datasets
Funding:
This work was supported by the Duke Department of Radiology Charles E. Putman Vision Award, NIH/NIBIB P41-EB028744, and NIH/NCI R01-CA261457.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
MRI-based artificial intelligence (AI) research on patients with brain gliomas has been rapidly increasing in popularity in recent years in part due to a growing number of publicly available MRI datasets. Notable examples include The Cancer Genome Atlas Glioblastoma dataset (TCGA-GBM) consisting of 262 subjects and the International Brain Tumor Segmentation (BraTS) challenge dataset consisting of 542 subjects (including 243 preoperative cases from TCGA-GBM). The public availability of these glioma MRI datasets has fostered the growth of numerous emerging AI techniques including automated tumor segmentation, radiogenomics, and MRI-based survival prediction. Despite these advances, existing publicly available glioma MRI datasets have been largely limited to only 4 MRI contrasts (T2, T2/FLAIR, and T1 pre- and post-contrast) and imaging protocols vary significantly in terms of magnetic field strength and acquisition parameters. Here we present the University of California San Francisco Preoperative Diffuse Glioma MRI (UCSF-PDGM) dataset. The UCSF-PDGM dataset includes 501 subjects with histopathologically-proven diffuse gliomas who were imaged with a standardized 3 Tesla preoperative brain tumor MRI protocol featuring predominantly 3D imaging, as well as advanced diffusion and perfusion imaging techniques. The dataset also includes isocitrate dehydrogenase (IDH) mutation status for all cases and O[6]-methylguanine-DNA methyltransferase (MGMT) promotor methylation status for World Health Organization (WHO) grade III and IV gliomas. The UCSF-PDGM has been made publicly available in the hopes that researchers around the world will use these data to continue to push the boundaries of AI applications for diffuse gliomas.
Data collection was performed in accordance with relevant guidelines and regulations and was approved by the University of California San Francisco institutional review board with a waiver for consent. The dataset population consisted of 501* adult patients with histopathologically confirmed grade II-IV diffuse gliomas who underwent preoperative MRI, initial tumor resection, and tumor genetic testing at a single medical center between 2015 and 2021. Patients with any prior history of brain tumor treatment were excluded; however, history of tumor biopsy was not considered an exclusion criterion.
All subjects’ tumors were tested for IDH mutations by genetic sequencing of tissue acquired during biopsy or resection. All grade III and IV tumors were tested for MGMT methylation status using a methylation sensitive quantitative PCR assay.
The 501* cases included in the UCSF-PDGM include 55 (11%) grade II, 42 (9%) grade III, and 403 (80%) grade IV tumors. There was a male predominance for all tumor grades (56%, 60%, and 60%, respectively for grades II-IV). IDH mutations were identified in a majority of grade II (83%) and grade III (67%) tumors and a small minority of grade IV tumors (8%). MGMT promoter hypermethylation was detected in 63% of grade IV gliomas and was not tested for in a majority of lower grade gliomas. 1p/19q codeletion was detected in 20% of grade II tumors and a small minority of grade III (5%) and IV (<1%) tumors. Tabulated details and glossary are available in the Data Access and Detailed Description tabs below.
All preoperative MRI was performed on a 3.0 tesla scanner (Discovery 750, GE Healthcare, Waukesha, Wisconsin, USA) and a dedicated 8-channel head coil (Invivo, Gainesville, Florida, USA). The imaging protocol included 3D T2-weighted, T2/FLAIR-weighted, susceptibility-weighted (SWI), diffusion-weighted (DWI), pre- and post-contrast T1-weighted images, 3D arterial spin labeling (ASL) perfusion images, and 2D 55-direction high angular resolution diffusion imaging (HARDI). Over the study period, two gadolinium-based contrast agents were used: gadobutrol (Gadovist, Bayer, LOC) at a dose of 0.1 mL/kg and gadoterate (Dotarem, Guerbet, Aulnay-sous-Bois, France) at a dose of 0.2 mL/kg.
HARDI data were eddy current corrected and processed using the Eddy and DTIFIT modules from FSL 6.0.2 yielding isotropic diffusion weighted images (DWI) and several quantitative diffusivity maps: mean diffusivity (MD), axial diffusivity (AD), radial diffusivity (RD), and fractional anisotropy (FA). Eddy correction was performed with outlier replacement on and topup correction off. DTIFIT was performed with simple least squares regression. Each image contrast was registered and resampled to the 3D space defined by the T2/FLAIR image (1 mm isotropic resolution) using automated non-linear registration (Advanced Normalization Tools). Resampled co-registered data were then skull stripped using a previously described and publicly available deep-learning algorithm: https://www.github.com/ecalabr/brain_mask/.
Multicompartment tumor segmentation of study data was undertaken as part of the 2021 BraTS challenge. Briefly, image data first underwent automated segmentation using an ensemble model consisting of prior BraTS challenge winning segmentation algorithms. Images were then manually corrected by trained radiologists and approved by 2 expert reviewers. Segmentation included three major tumor compartments: enhancing tumor, non-enhancing/necrotic tumor, and surrounding FLAIR abnormality (sometimes referred to as edema).
The UCSF-PDGM adds to on an existing body of publicly available diffuse glioma MRI datasets that are commonly used in AI research applications. As MRI-based AI research applications continue to grow, new data are needed to foster development of new techniques and increase the generalizability of existing algorithms. The UCSF-PDGM not only significantly increases the total number of publicly available diffuse glioma MRI cases, but also provides a unique contribution in terms of MRI technique. The inclusion of 3D sequences and advanced MRI techniques like ASL and HARDI provides a new opportunity for researchers to explore the potential utility of cutting-edge clinical diagnostics for AI applications. In addition, these advanced imaging techniques may prove useful for radiogenomic studies focused on identification of IDH mutations or MGMT promoter methylation.
The UCSF-PDGM dataset, particularly when combined with existing publicly available datasets, has the potential to fuel the next phase of radiologic AI research on diffuse gliomas. However, the UCSF-PDGM dataset’s potential will only be realized if the radiology AI research community takes advantage of this new data resource. We hope that this dataset sparks inspiration in the next generation of AI researchers, and we look forward to the new techniques and discoveries that the UCSF-PDGM will generate.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.ResultsIn general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.ConclusionsThe number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
HANCOCK is a comprehensive, monocentric dataset of 763 head and neck cancer patients, including diverse data modalities. It contains histopathology imaging (whole-slide images of H&E-stained primary tumors and tissue microarrays with immunohistochemical staining) alongside structured clinical data (demographics, tumor pathology characteristics, laboratory blood measurements) and textual data (de-identified surgery reports and medical histories). All patients were treated curatively, and data span diagnoses from 2005–2019. This multimodal collection enables research into integrative analyses – for example, combining histologic features with clinical parameters for outcome prediction. Early analyses have demonstrated that fusing these modalities improves prognostic modeling compared to single-source data, and that leveraging histology with foundation models can enhance endpoint prediction. HANCOCK aims to facilitate precision oncology studies by providing a large public resource for developing and benchmarking multimodal machine learning methods in head and neck cancer.
Head and neck cancer (HNC) is a prevalent malignancy with poor outcomes – it is the 7th most common cancer globally and carries a 5-year survival of only ~25–60% despite modern treatments. Improving patient prognosis may require personalized, multimodal therapy decisions, using information from pathology, clinical, and other data sources. However, progress in multimodal prediction has been limited by the lack of large public datasets that integrate these diverse data types. To our knowledge, existing HNC datasets are either small or incomplete; for example, a radiomics study included 288 oropharyngeal cases, and a proteomics-focused set with imaging had only 122 cases. The Cancer Genome Atlas (TCGA) provides multi-omics for >500 HNC cases, but lacks crucial data like pathology reports, blood tests, or comprehensive imaging for each patient. These limitations hinder robust multimodal research.
HANCOCK was created to address this gap. It aggregates 763 patients’ data from a single academic center, capturing a real-world, uniformly treated cohort. The dataset uniquely combines whole slide histopathology images, tissue microarray images, detailed clinical parameters, pathology reports, and lab values in one resource. By curating and harmonizing these modalities, HANCOCK enables researchers to explore complex data interdependencies and develop multimodal predictive models. The patient population reflects typical HNC demographics – 80% male, median age 61, with 72% being former or current smokers – aligning with expected epidemiology and supporting generalizability. In summary, HANCOCK is an unprecedented multimodal HNC dataset that can fuel research in machine learning, prognostic biomarker discovery, and integrative oncology, ultimately advancing personalized head and neck cancer care.
The following sections describe how the HANCOCK data were collected, processed, and prepared for public sharing.
Patients included in HANCOCK were those diagnosed with head and neck cancer between 2005 and 2019 at University Hospital Erlangen (Germany) who underwent a curative-intent initial treatment (surgery and/or definitive therapy). This encompasses cancers of the oral cavity, oropharynx, hypopharynx, and larynx. Patients treated palliatively or with recurrent/metastatic disease at presentation were excluded to focus on first-course, curative treatments. The cohort consists of 763 patients (approximately 80% male, 20% female) with a median age of 61 years. Notably, ~72% have a history of tobacco use, which is consistent with real-world HNC risk factors. The distribution of tumor subsites and stages reflects typical HNC presentation, and thus the dataset is broadly representative of the general HNC patient population. Being a single-center dataset, there is limited geographic diversity; however, the homogeneous data acquisition and treatment context reduce variability in data quality. No significant selection biases were introduced aside from the exclusion of non-curative cases – all major HNC subsite cases over the inclusion period were captured, providing a comprehensive real-world sample. Ethical approval was obtained for this retrospective data collection and sharing (Ethics Committee vote #23-22-Br), and all data were fully de-identified prior to release.
Histopathology: Tissue specimens from the primary tumors (and involved lymph nodes, if present) were obtained from the pathology archives. All samples were formalin-fixed and paraffin-embedded (FFPE) and stained with hematoxylin and eosin (H&E) following routine protocols. Digital whole-slide imaging was performed on these histology slides. A total of 709 H&E slides of primary tumor tissue (701 patients had one slide, 8 patients had two slides) were scanned at high resolution using a 3DHISTECH P1000 scanner at an effective 82.44× magnification (0.1213 µm/pixel). Additionally, 396 H&E slides of lymph node metastases were scanned, using two systems: an Aperio Leica GT450 at 40× (0.2634 µm/pixel) and the 3DHISTECH P1000 at ~51× (0.1945 µm/pixel). (Multiple scanners were utilized over the course of the project; all resulting images were cross-verified for quality.) The digital whole slide images (WSIs) are provided in the pyramidal Aperio SVS format, a TIFF-based format compatible with standard viewers.
In addition to full slides, tissue microarrays (TMAs) were constructed from each patient’s tumor block to sample important regions. For each case, two cylindrical core biopsies (diameter 1.5 mm) were taken – one from the tumor center and one from the invasive tumor front. These cores were assembled into TMA blocks and stained on separate slides with a panel of eight stains: H&E plus immunohistochemical (IHC) markers targeting various immune cells and tumor biomarkers. The IHC markers include CD3, CD8, CD56, CD68, CD163, PD-L1, and MHC-1, which label T cells (CD3, CD8), natural killer cells (CD56), monocytes/macrophages (CD68, CD163), and a tumor immune checkpoint ligand (PD-L1), as well as MHC class I expression. Each core appears on up to 8 stained TMA slides (one per stain), yielding up to 16 TMA images per patient (two cores × eight stains). In the dataset, TMA images are provided for both the tumor-center and tumor-front cores; these too are digitized high-resolution images (consistent microscope settings, ~40×). The combination of WSIs and TMAs yields a rich imaging dataset: 701 patients have at least one primary tumor WSI (62 patients lack WSIs due to unavailable tissue), and all patients have TMA core images unless the tumor block was exhausted. This imaging data offers both broad tissue context from WSIs and targeted cellular detail from TMAs. Manual tumor region annotations are also included for the primary tumor WSIs (see Data Analysis below).
Clinical and Pathology Data: A wide array of non-imaging data was extracted from hospital information systems and pathology reports for each patient. Key demographic variables (age, sex, etc.) and tumor pathology details were collected, including primary tumor site, histologic subtype, grade, TNM stage, resection margin status, depth of invasion, perineural and lymphovascular invasion, and nodal metastasis status. These pathology parameters were recorded in a structured format for each case. Standard clinical coding systems were used where applicable: e.g., diagnoses are coded with ICD-10 codes and procedures with OPS codes (the German procedure classification system). The dataset includes these codes for each patient’s conditions and treatments. Comprehensive laboratory blood test results at diagnosis or pre-treatment were also compiled, covering complete blood counts, coagulation measures, electrolytes, kidney function, C-reactive protein, and other relevant analytes. Reference ranges for each lab parameter are provided alongside the values to indicate whether a result was normal or abnormal. Most patients have a full panel of these lab results, though some values are missing if a test was not clinically indicated; the dataset notes availability per patient. All structured data have been cleaned and validated – for example, harmonizing category values and checking consistency (e.g. TNM stages align with recorded tumor sites).
Textual Data (Surgical Reports and Histories): Unstructured clinical text was also included to add rich context on treatment details. Surgery reports (operative notes) from the primary tumor resection and associated medical history summaries were retrieved from the hospital’s electronic records. For each patient, the operative report from their first definitive surgery and the corresponding
Facebook
TwitterThese data contain the results of GC-MS, LC-MS and immunochemistry analyses of mask sample extracts. The data include tentatively identified compounds through library searches and compound abundance. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The data can not be accessed. Format: The dataset contains the identification of compounds found in the mask samples as well as the abundance of those compounds for individuals who participated in the trial. This dataset is associated with the following publication: Pleil, J., M. Wallace, J. McCord, M. Madden, J. Sobus, and G. Ferguson. How do cancer-sniffing dogs sort biological samples? Exploring case-control samples with non-targeted LC-Orbitrap, GC-MS, and immunochemistry methods. Journal of Breath Research. Institute of Physics Publishing, Bristol, UK, 14(1): 016006, (2019).
Facebook
TwitterPopulation based cancer incidence rates were abstracted from National Cancer Institute, State Cancer Profiles for all available counties in the United States for which data were available. This is a national county-level database of cancer data that are collected by state public health surveillance systems. All-site cancer is defined as any type of cancer that is captured in the state registry data, though non-melanoma skin cancer is not included. All-site age-adjusted cancer incidence rates were abstracted separately for males and females. County-level annual age-adjusted all-site cancer incidence rates for years 2006–2010 were available for 2687 of 3142 (85.5%) counties in the U.S. Counties for which there are fewer than 16 reported cases in a specific area-sex-race category are suppressed to ensure confidentiality and stability of rate estimates; this accounted for 14 counties in our study. Two states, Kansas and Virginia, do not provide data because of state legislation and regulations which prohibit the release of county level data to outside entities. Data from Michigan does not include cases diagnosed in other states because data exchange agreements prohibit the release of data to third parties. Finally, state data is not available for three states, Minnesota, Ohio, and Washington. The age-adjusted average annual incidence rate for all counties was 453.7 per 100,000 persons. We selected 2006–2010 as it is subsequent in time to the EQI exposure data which was constructed to represent the years 2000–2005. We also gathered data for the three leading causes of cancer for males (lung, prostate, and colorectal) and females (lung, breast, and colorectal). The EQI was used as an exposure metric as an indicator of cumulative environmental exposures at the county-level representing the period 2000 to 2005. A complete description of the datasets used in the EQI are provided in Lobdell et al. and methods used for index construction are described by Messer et al. The EQI was developed for the period 2000– 2005 because it was the time period for which the most recent data were available when index construction was initiated. The EQI includes variables representing each of the environmental domains. The air domain includes 87 variables representing criteria and hazardous air pollutants. The water domain includes 80 variables representing overall water quality, general water contamination, recreational water quality, drinking water quality, atmospheric deposition, drought, and chemical contamination. The land domain includes 26 variables representing agriculture, pesticides, contaminants, facilities, and radon. The built domain includes 14 variables representing roads, highway/road safety, public transit behavior, business environment, and subsidized housing environment. The sociodemographic environment includes 12 variables representing socioeconomics and crime. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., L. Messer, K. Rappazzo , C. Gray, S. Grabich , and D. Lobdell. County-level environmental quality and associations with cancer incidence#. Cancer. John Wiley & Sons Incorporated, New York, NY, USA, 123(15): 2901-2908, (2017).
Facebook
TwitterSister Study is a prospective cohort of 50,884 U.S. women aged 35 to 74 years old conducted by the NIEHS. Eligible participants are women without a history of breast cancer but with at least one sister diagnosed with breast cancer at enrollment during 2003 - 2009. Datasets used in this research effort include health outcomes, lifestyle factors, socioeconomic factors, medication history, and built and natural environment factors. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Contact NIEHS Sister Study (https://sisterstudy.niehs.nih.gov/English/index1.htm) for data access. Format: Datasets are provided in SAS and/or CSV format.
Facebook
TwitterResistance to endocrine therapy in estrogen receptor-positive (ER+) breast cancer is a major clinical problem with poorly understood mechanisms. In this study, the authors evaluated the mechanism by which minichromosome maintenance protein 3 (MCM3) influences endocrine resistance and its predictive/prognostic potential in ER+ breast cancer.Data access: The mass spectrometry proteomics data generated during the study, are publicly available in the PRIDE repository, under the accession number https://identifiers.org/pride.project:PXD001087. The effect of MCM3 knockdown on gene expression in TamR cell lines data, are publicly available in Gene Expression Omnibus, under the accession number: https://identifiers.org/geo:GSE148878. The three microarray gene datasets analysed during the study, are publicly available in Gene Expression Omnibus, under the following accession numbers: https://identifiers.org/geo:GSE20361, https://identifiers.org/geo:GSE38829 and https://identifiers.org/geo:GSE50820. Microarray data from 2555 breast cancer patients (cohort 3), analysed during the study, were obtained from the kmplot.com database (www.kmplot.com). Survival analyses datasets supporting figures 2 and 3, are not publicly available to protect patient privacy, but will be made available to authorized researchers who have an approved Institutional Review Board application and have obtained approval from The Regional Committees on Health Research Ethics for Southern Denmark. Please contact the corresponding author with data access requests. All other datasets generated during the study (including immunohistochemistry data and phospho-specific cell cycle antibody microarray analysis data) will be made available upon reasonable request from the corresponding author, Dr. Henrik Ditzel, email address: hditzel@health.sdu.dk. Supplementary tables 1 and 4 are available in the figshare repository, as part of this data record. Uncropped western blots are part of the supplementary files.Study approval: All tissue samples were collected in compliance with informed consent policy. The study was approved by the local ethical committee at Karolinska Institute, Stockholm, Sweden.Study aims and methodology: Approximately 80% of breast cancers express the estrogen-receptor (ER+), rendering them suitable for adjuvant anti-estrogen treatment. Although tamoxifen is of great benefit for many ER+ breast cancer patients, recurrence occurs in approximately 30% over 15-years of follow-up. Tamoxifen is a selective estrogen receptor modulator (SERM) with both antagonistic and agonistic tissue-dependent effects. In vitro, tamoxifen acts as a partial estrogen antagonist, by antagonizing the estrogen regulation of the transcription of most ER-regulated genes and inhibiting growth of estrogen receptor-dependent breast cancer cells.The minichromosome maintenance 3 protein (MCM3) protein belongs to a family of 6 highly conserved minichromosome maintenance proteins (MCM2-MCM7) that are essential to ensure eukaryotic DNA is replicated only once per cell cycle, and additionally acts as a helicase to drive replication elongation. In this study, the authors used a quantitative proteomic approach combined with systems biology analyses, to gain insight into the biology of endocrine resistance in breast cancer, and to identify potential predictive or prognostic markers.MCM3 levels in primary tumors from four independent cohorts of breast cancer patients receiving adjuvant tamoxifen mono-therapy or no adjuvant treatment, including the Stockholm tamoxifen (STO-3) trial, were evaluated.Cohort 1: ER+ primary breast cancer tissues from 79 patients collected from Herlev and Roskilde Hospitals, Denmark.Cohort 2: Retrospective cohort of 589 patients from the Danish Breast Cancer Co-operative Group (DBCG) 89C randomized study.Cohort 3: 2555 breast cancer patients of which 2051 were ER+ (1802 were endocrine treated and 503 did not receive any systemic treatment). All data for cohort 3 were obtained from www.kmplot.com database.Cohort 4: Stockholm Breast Cancer Study Group randomised tamoxifen STO-3 trial 1976-1990. A cohort of 1,780 postmenopausal women with breast cancer was randomised to adjuvant tamoxifen for 2 or 5 years (n = 886), or no adjuvant endocrine therapy (n = 894).The human breast cancer cell line MCF-7 was originally received from The Breast Cancer Task Force Cell Culture Bank, Mason Research Institute. Cell lines including tamoxifen-sensitive subline MCF-7/S0.5, tamoxifen-resistant cell lines MCF-7/TAMR-1 (TamR-1), MCF-7/TAMR-4 (TamR-4), MCF-7/TAMR-7 (TamR-7) and fulvestrant-resistant cell line FulvR-1, were derived from the MCF-7 cells.The following techniques and assays are described in more detail in the published article, and its supplementary methods: mass spectrometry-based proteomic analysis, descriptions of the characteristics of patient cohorts, xenograft tumor models, immunohistochemical staining, targeted gene knockdown using siRNA, and statistical analyses.Datasets supporting the findings reported in the article: All the datasets supporting the findings of this study are listed in the data file Løkkegaard, S. et al.xlsx.Supplementary table 1 (List of proteins with increased expression in TAMR-1 vs. MCF-7S0.5 cells.xlsx), provides a list of the 275 proteins showing increased expression, and 264 showed reduced expression in TAMR-1 vs. MCF-7/S0.5 cells defined as ≥1.5-fold differential expression.Supplementary table 4 (MCM3 gene knockdown on gene expression profile of tamoxifen-resistant breast cancer cell lines.xlsx), includes data on genes exhibiting altered expression in TamR cells following MCM3 knockdown (transfected with MCM3-specific siRNAs) versus cells transfected with siControl (FDR < 0.05 and ≥ 1.5 fold altered expression).Software needed to access data: Files in. dta format can be accessed using the STATA statistical software.
Facebook
TwitterTriple-negative breast cancer (TNBC) is a heterogeneous disease that lacks both effective patient stratification strategies and therapeutic targets. Whilst elevated levels of the MET receptor tyrosine kinase are associated with TNBCs and predict poor clinical outcome, the functional role of MET in TNBC is still poorly understood. In this study, the authors utilized an established Met-dependent transgenic mouse model of TNBC, human cell lines, and patient-derived xenografts to investigate the role of MET in TNBC tumourigenesis. Data access: Processed RNA sequencing datasets generated during the study, are available in Gene expression Omnibus: https://identifiers.org/geo:GSE162272. The raw RNA sequencing data are available in Sequence Read Archive: https://identifiers.org/ncbi/insdc.sra:SRP294504. All other datasets generated and analysed during the study (including tumoursphere formation assays, tumoursphere proliferation assays, immunohistochemistry data, quantitative RT-PCR, in vivo inhibitor treatments (including tumour volume calculations), flow cytometry data and immunofluorescence data) are publicly available in the figshare repository as part of this data record. The publicly available TCGA data analysed during the study are available in cBioPortal for Cancer Genomics: https://identifiers.org/cbioportal:brca_tcga_pub. Microarray data from the MMTV-Metmt;Trp53fl/+;Cre tumours analysed during the study, are available in Gene Expression Omnibus: https://identifiers.org/geo:GSE41601. RNA sequencing data from breast cancer pairs of primary tumors and PDXs, analysed during the study, are also available in Gene Expression Omnibus: https://identifiers.org/geo:GSE142767. Uncropped Western blots are part of the supplementary files that accompany the article. Study approval and patient consent: All human participants provided informed consent for this study and tissue was collected at McGill University Health Center in accordance with the protocols approved by the research ethics board (SUR-99-780). All animal studies linked to this protocol were approved by the McGill University Animal Care Committee (2014-7514). The Biobank protocol (05-006) and the protocol to generate PDX from biobank tissues (14-168) were both approved by Jewish General Hospital ethics committee. Study aims and methodology: In the present study, the authors assayed tumour-initiating cells (TIC) properties to directly investigate the role of Met in tumour initiation and identify FGFR1 signaling as a key convergent pathway with Met for the maintenance of TICs. Primary mouse cell lines were established by dissociation of MMTV-Metmt, Trp53fl/+;Cre, and MMTV-Metmt;Trp53fl/+;Cre mammary tumours as previously described. Additionally, the following cell lines were used during the study: BT-20, HCC70, HCC1937, HCC1954, HCC1395, MDA-MB-468, MDA-MB-436, MDA-MB-157, MDA-MB-231, BT-549, and Hs578T. The following are described in more detail in the published article: cell culture, patient-derived xenografts, antibodies and reagents, lentiviral infection, tumoursphere formation assays, tumoursphere proliferation assays, Western blot analysis, quantitative RT-PCR, immunohistochemistry, RNA sequencing, in vivo limiting dilution assay, in vivo inhibitor treatments, flow cytometry, immunofluorescence, tumour dissociation, analysis of gene expression data, and statistical analysis. Data supporting the figures, supplementary figures and supplementary tables in the article: This data record consists of a total of 38 data files in the following file formats: .xlsx, .pdf, .csv, .txt, .png and tiff. A list of all the datasets generated during the study, are included in the file Sung, V. et al.xlsx.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[Dataset provided by]: Eric Bair, Robert Tibshirani
🟩An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.
History 2013-01-20 - Posted date 2016-10-28 - First online date
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
List of active studies submitted by Roswell Park Cancer Institute (RPCI) to National Cancer Institute (NCI) annually as part of the Cancer Center Report Grant reporting. It includes the primary site, protocol, principal investigator, date opened, phase and study name.
This is a dataset hosted by the State of New York. The state has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York State using Kaggle and all of the data sources available through the State of New York organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
Cover photo by Ian Schneider on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
| Characteristic | Value (N = 26254) |
|---|---|
| Age (years) | Mean ± SD: 61.4± 5 Median (IQR): 60 (57-65) Range: 43-75 |
| Sex | Male: 15512 (59%) Female: 10742 (41%) |
| Race | White: 23969 (91.3%) |
| Ethnicity | Not Available |
Background: The aggressive and heterogeneous nature of lung cancer has thwarted efforts to reduce mortality from this cancer through the use of screening. The advent of low-dose helical computed tomography (CT) altered the landscape of lung-cancer screening, with studies indicating that low-dose CT detects many tumors at early stages. The National Lung Screening Trial (NLST) was conducted to determine whether screening with low-dose CT could reduce mortality from lung cancer.
Methods: From August 2002 through April 2004, we enrolled 53,454 persons at high risk for lung cancer at 33 U.S. medical centers. Participants were randomly assigned to undergo three annual screenings with either low-dose CT (26,722 participants) or single-view posteroanterior chest radiography (26,732). Data were collected on cases of lung cancer and deaths from lung cancer that occurred through December 31, 2009. This dataset includes the low-dose CT scans from 26,254 of these subjects, as well as digitized histopathology images from 451 subjects.
Results: The rate of adherence to screening was more than 90%. The rate of positive screening tests was 24.2% with low-dose CT and 6.9% with radiography over all three rounds. A total of 96.4% of the positive screening results in the low-dose CT group and 94.5% in the radiography group were false positive results. The incidence of lung cancer was 645 cases per 100,000 person-years (1060 cancers) in the low-dose CT group, as compared with 572 cases per 100,000 person-years (941 cancers) in the radiography group (rate ratio, 1.13; 95% confidence interval [CI], 1.03 to 1.23). There were 247 deaths from lung cancer per 100,000 person-years in the low-dose CT group and 309 deaths per 100,000 person-years in the radiography group, representing a relative reduction in mortality from lung cancer with low-dose CT screening of 20.0% (95% CI, 6.8 to 26.7; P=0.004). The rate of death from any cause was reduced in the low-dose CT group, as compared with the radiography group, by 6.7% (95% CI, 1.2 to 13.6; P=0.02).
Conclusions: Screening with the use of low-dose CT reduces mortality from lung cancer. (Funded by the National Cancer Institute; National Lung Screening Trial ClinicalTrials.gov number, NCT00047385).
Data Availability: A summary of the National Lung Screening Trial and its available datasets are provided on the Cancer Data Access System (CDAS). CDAS is maintained by Information Management System (IMS), contracted by the National Cancer Institute (NCI) as keepers and statistical analyzers of the NLST trial data. The full clinical data set from NLST is available through CDAS. Users of TCIA can download without restriction a publicly distributable subset of that clinical data, along with the CT and Histopathology images collected during the trial. (These previously were restricted.)
Facebook
TwitterSummary This metadata record provides details of the data supporting the claims of the related manuscript: “The CINSARC signature predicts the clinical outcome in patients with Luminal B breast cancer”. The related study tested the prognostic value for disease-free survival (DFS) of CINSARC, a multigene expression signature originally developed in sarcomas and shown to have prognostic impact in various cancers, in a series of 6035 early-stage invasive primary breast cancers. Type of data: prognostic value for DFS of CINSARC Subject of data: Homo sapiens Sample size: 6035 Population characteristics: All cases were invasive breast carcinomas profiled using DNA microarrays or RNA-sequencing with expression and clinicopathological data available. All samples are pre-treatment samples (operative specimen or diagnostic biopsy before neo-adjuvant chemotherapy). The detailed characteristics of patients and tumours analysed in the present study are available in Supplementary Table 10. Recruitment: publicly available transcriptomic data of invasive primary breast cancer enrolled in 36 retrospective studies published over a 10-year period between 2002 and 2012. Data access All data sets of primary breast cancer were downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), ArrayExpress (https://www.ebi.ac.uk/arrayexpress/), Genomic Data Commons (GDC, https://portal.gdc.cancer.gov/) and cBioPortal (https://www.cbioportal.org/) databases. All accession IDs are provided in Supplementary Table 10 (Table S10 revised.xlsx), which is included with this data record. The data underlying the figures and tables of the related article are contained in the files ‘Goncalves_supporting_data.xlsx’ and ‘Table S8.xlsx’, which are included with this data record. A detailed list of the data underlying each figure and table of the related article is available in the file ‘Goncalves_2021_underlying_data_list.xlsx’, which is included with this data record. Corresponding author(s) for this study Pr François BERTUCCI, MD PhD, Département d’Oncologie Médicale, Institut Paoli-Calmettes, 232 Bd. Ste-Marguerite, 13009 Marseille, France e-mail:bertuccif@ipc.unicancer.fr ; Phone : +33 4 91 22 35 37 ; Fax : +33 4 91 22 36 70 Study approval The details of Institutional Review Board and Ethical Committee approval and patients’ consent for the 36 studies analysed in the related study are present in their corresponding publications, which are listed in Supplementary Table 10 of the related article.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Background and Purpose: Colorectal cancer is a common fatal malignancy, the fourth most common cancer in men, and the third most common cancer in women worldwide. Timely detection of cancer in its early stages is essential for treating the disease. Currently, there is a lack of datasets for histopathological image segmentation of rectal cancer, which often hampers the assessment accuracy when computer technology is used to aid in diagnosis. Methods: This present study provided a new publicly available Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image Dataset for Image Segmentation Tasks (EBHI-Seg). To demonstrate the validity and extensiveness of EBHI-Seg, the experimental results for EBHI-Seg are evaluated using classical machine learning methods and deep learning methods. Results: The experimental results showed that deep learning methods had a better image segmentation performance when utilizing EBHI-Seg. The maximum accuracy of the Dice evaluation metric for the classical machine learning method is 0.948, while the Dice evaluation metric for the deep learning method is 0.965. Conclusion: This publicly available dataset contained 5,170 images of six types of tumor differentiation stages and the corresponding ground truth images. The dataset can provide researchers with new segmentation algorithms for medical diagnosis of colorectal cancer, which can be used in the clinical setting to help doctors and patients.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
We provide a large, annotated dataset of 597 whole-body PSMA-PET/CT studies from 378 male patients with suspected or diagnosed prostate carcinoma. Scans were acquired at LMU University Hospital, LMU Munich (2014–2022) on three clinical PET/CT scanners, jointly operated by the Departments of Nuclear Medicine and Radiology. All PSMA-avid tumor lesions were manually segmented on the PET images in 3D space using a dedicated software solution. 537 studies contain at least one lesion, while 60 show no lesions. The dataset includes anonymized DICOM files, DICOM segmentation masks, and a TSV file with patient age at imaging, PET/CT manufacturer and model name, PET radionuclide, and use of CT contrast agent. This dataset was used in the autoPET III and IV Grand Challenges to enable the development of machine-learning models for automated lesion segmentation in whole-body PET/CT.
We provide a large, annotated dataset of whole-body PSMA-PET/CT studies from patients with suspected or diagnosed prostate cancer to support developing and benchmarking machine learning (ML) models for automated quantitative PET/CT analysis. Alongside the FDG-PET/CT dataset on TCIA , this dataset addresses the scarcity of publicly available, high-quality annotated PET/CT data. The FDG and PSMA-PET/CT datasets were jointly provided as training data for developing ML models in the autoPET III and autoPET IV Grand challenges for automated lesion segmentation in whole-body PET/CT.
The institutional review board (Ethics Committee, Medical Faculty, LMU Munich), as well as the institutional data security and privacy review board, approved the publication of anonymized data. This retrospective dataset comprises 597 whole-body PSMA-PET/CT studies from 378 male patients (ages 48–92*) with suspected or diagnosed prostate carcinoma.
*Due to PHI criteria, all ages above 89 years in the metadata and the spreadsheet are set to 90 years, regardless of actual age.
Scans were conducted at LMU University Hospital, LMU Munich, between 2014 and 2022 using three clinical PET/CT scanners: Siemens Biograph mCT Flow 20, Siemens Biograph 64-4R TruePoint, and GE Discovery 690. 537 studies contain at least one PSMA-avid tumor lesion, 60 studies do not contain any PSMA-avid tumor lesion. The imaging protocol consisted of a diagnostic CT scan usually from the skull base to the mid-thigh with the following scan parameters: reference tube current exposure time product of 143 mAs (mean); tube voltage of 120 kV or 100 kV for most cases (range: [80, 140] kV), slice thickness of 2.5 - 5.0 mm (mean: 2.82 mm), and x-y resolution of mainly 0.98 mm. Intravenous contrast enhancement was used in most studies, except for patients with contraindications (26 studies). The whole-body PSMA-PET scan was acquired on average 74 minutes after intravenous injection of 246 MBq 18F-PSMA (mean, 369 studies) or 214 MBq 68Ga-PSMA (mean, 228 studies), respectively. The PET data was reconstructed with attenuation correction derived from corresponding CT data using standard, vendor-provided image reconstruction algorithms with a slice thickness ranging from 3.0 - 5.0 mm (mean: 3.49 mm) and x-y resolution ranging from 2.73 - 4.07 mm (mean: 3.56 mm).
All PSMA-avid tumor lesions, including the primary tumor and/or all metastases, were manually segmented on the PET images by a single reader with 3 years of experience in hybrid imaging using dedicated software (mint Medical, Heidelberg, Germany) and validated by board-certified medical imaging experts with 4 years and >10 years of experience in hybrid imaging. Tumor lesions with significantly increased PSMA expression were segmented in 3D space by drawing circular VOIs, in which voxels with uptake values above a user-defined threshold were pre-segmented automatically and then manually corrected slice by slice, resulting in 3D binary segmentation masks and saved as NRRD files by the software. These files were exported and combined to a single segmentation mask per study and converted to DICOM SEG using the highdicom package v0.22.0 in Python v3.8.13. In addition, patient metadata was extracted from imaging DICOM tags: patient age at imaging (in years), PET/CT manufacturer and model name, PET radionuclide, and use of CT contrast agent. Information on radionuclides and the use of CT contrast agents was visually reviewed and validated by a radiologist with 10 years of experience in hybrid imaging.
For each of the 597 PSMA PET/CT studies, we provide the anonymized original PET and CT DICOM files, and the corresponding segmentation mask as DICOM SEG. To view the DICOM data, we recommend open-source medical image data viewers such as 3D Slicer or the Medical Imaging Interaction Toolkit. For computational analysis, e.g. in Python, 3D image volumes can be read using open-source libraries such as pydicom, nibabel, or SimpleITK.
The patient metadata extracted from DICOM tags is shared in a TSV file. Each row contains the information on one study. Each study is uniquely identified by a case identifier number and the study date.
This dataset contains images of the head which, in theory, could pose re-identification risks using advanced image processing techniques. For this dataset TCIA implemented a “de-facing” pipeline to generate a version of this dataset without identifiable facial features which are published under an open-access license on TCIA. The original unaltered data will be made available at https://general.datacommons.cancer.gov/.
Facebook
TwitterFull Abstract: Introduction: Triple-negative breast cancer (TNBC) is a highly metastatic type of breast cancer and one of the largest contributors to cancer mortality in women. Unlike other breast cancers, TNBC lacks any approved therapeutic targets. Scientists are rigorously attempting to decipher molecular pathways enriched in TNBC and to design clinically applicable therapeutics. Many TNBC drugs that successfully produce general antitumor effects in vitro fail to display significant long-lasting positive effects at the clinical level. This is in part because they do not effectively suppress the growth of cancer stem cells (CSCs), which have increased ability to evolve into metastatic tumors and are associated with enrichment of immunosuppressive pathways. Moreover, it has been shown that in TNBC, dormant CSCs are able to change their metabolic signature to escape the toxic effects of these drugs; these modified metabolic signatures are shown to be causally associated with increased metastasis. Therefore, a successful, clinically-applicable therapy must have the ability to selectively inhibit CSC growth, the metastatic metabolic signature, and pathways involved in immunosuppression. Objective: This study will evaluate the potential of four recently proposed TNBC treatments—which all successfully reduced tumor viability in vitro and/or in vivo—to inhibit genes involved in CSC survival, metastatic metabolic signature, and tumor immunosuppression. Methods: TNBC cell lines and/or patient-derived xenografts were treated with four different treatments: DCC-2036, 9Gy proton irradiation, miR302b+cisplatin combination, and DFX+doxorubicin combination. Genome-wide mRNA profiling (via either RNA-seq or microarray) was performed on control and treated groups. Data was obtained from publicly-deposited NCBI GEO datasets. We assessed the differential expression of over 40 genes associated with CSC growth, metastatic metabolic modifications, and immunosuppression in TNBC tumors. Limma statistical analysis was performed. GSEA was also used to complement results from individual gene expression analysis. Results: DCC-2036 treatment significantly induced the expression of CSC TNBC biomarkers—such as ALDH2, CD44, CCR5, and SNAI1—and genes associated with TNBC metastatic metabolomic signature—such as PPARGC1A. DCC-2036 showed inconsistent effects on the expression of immunosuppressive markers. 9Gy proton irradiation has mixed effects on the expression of our candidate genes, yet mostly induced the expression of stemness, metastatic, and immunosuppressive markers. miR302b+cisplatin and DFX+doxorubicin both failed to inhibit the candidate genes, yet without significantly inducing their expression. GSEA analysis confirmed the results obtained for all four treatments. Conclusions: Observing cancer rebound in TNBC patients after treatment with traditional cancer drugs is common and often happens when treatments fail to inhibit CSC growth, metabolic pathways associated with metastasis, and oncogenic immunosuppressive pathways. Our analysis shows that all four treatments failed to significantly impact the expression of protein pathways associated with increased metastasis and immunosuppression. It is worth noting that the researchers did report a decrease in tumor viability due to treatment of their experimental models with all four treatments. However, these findings correspond to the viability of the whole cell culture or tumor, not the viability of specifically the CSCs; in TNBC, CSCs make up only a small proportion of the total mass or the tumor, so the reported antiproliferative effects of the treatments do not necessarily suggest the treatment has effectively targeted the CSC population. Therefore, we hypothesize that these non-targeted therapies will likely not show positive effects in clinical studies. Furthermore, none of the researchers performed any assays evaluating CSC growth—such as CSC-labelled flow cytometry—or metastasis—such as secondary tumor transplantation. Therefore, we encourage the researchers to perform more rigorous assays to evaluate the translatable potential of their treatments. Finally, the outline of this study provides a useful rationale for future studies to evaluate emerging TNBC therapies and serves as a motivation for further in-silico research focus.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
SummaryThis metadata record provides details of the data supporting the related manuscript: “Breast cancer gene expression datasets do not reflect the disease at the population level”. The related study aimed to determine how representative publicly available tumor gene expression datasets are of clinical populations.As the data are all publicly available in appropriate community repositories, no primary data is included with this metadata record. Instead, the attached spreadsheet lists the 70 publicly available datasets, along with their respective details, including the repositories in which they are stored and their accession numbers. The 70 datasets represent 16,130 breast carcinomas.Data accessAll of the gene expression datasets analysed in the study are already publicly available, and their accession numbers and original publication references are listed in the Supplementary Table included with this metadata record.The 70 publicly available datasets were identified in the public domain when restricting the search to those studies representing a minimum of 50 breast cancer patients with primary tumours.