Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Breast Phenotype Research Group.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Following the same steps that we used in the previous course we downloaded the TCGA-BRCA using R and Bioconductor and in particular the TCGABiolinks package. We downloaded transcriptome profiling of gene expression quantification where the experimental strategy is (RNAseq) and the workflow type is HTSeq-FPKM-UQ and only primary solid tumor data of the affymetrix GPL86 profile and clinical data.
Facebook
TwitterThe dataset used in the paper for whole slide image (WSI) classification, which is a type of digital pathology. The dataset consists of histopathology and cytopathology images, and is used to evaluate the performance of the proposed method for WSI classification.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
At the time of our study, 108 cases with breast MRI data were available in the The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA) collection. In order to minimize variations in image quality across the multi-institutional cases we included only breast MRI studies acquired on GE 1.5 Tesla magnet strength scanners (GE Medical Systems, Milwaukee, Wisconsin, USA) scanners, yielding a total of 93 cases. We then excluded cases that had missing images in the dynamic sequence (1 patient), or at the time did not have gene expression analysis available in the TCGA Data Portal (8 patients). After these criteria, a dataset of 84 breast cancer patients resulted, with MRIs from four institutions: Memorial Sloan Kettering Cancer Center, the Mayo Clinic, the University of Pittsburgh Medical Center, and the Roswell Park Cancer Institute. The resulting cases contributed by each institution were 9 (date range 1999-2002), 5 (1999-2003), 46 (1999-2004), and 24 (1999-2002), respectively. The dataset of biopsy proven invasive breast cancers included 74 (88%) ductal, 8 (10%) lobular, and 2 (2%) mixed. Of these, 73 (87%) were ER+, 67 (80%) were PR+, and 19 (23%) were HER2+. Various types of analyses were conducted using the combined imaging, genomic, and clinical data. Those analyses are described within several manuscripts created by the group (cited below). Additional information about the methodology for how the Radiologist Annotations file can be found on the TCGA Breast Image Feature Scoring Project page.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the SNP data downloaded from Xena public database
Facebook
TwitterNerdyVisky/sample-tcga-brca dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We extracted 320 samples of the TNBC subtype of breast cancer from METABRIC (Breast Cancer) with a total of 2509 samples and merged 127 TNBC sample from TCGA-BRCA and used merged 447 samples for validation.
Facebook
TwitterThis dataset was created by Joy Dhar
Facebook
TwitterNerdyVisky/TCGA-BRCA-30-samples dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA BRCA samples somatic mutation data in BED format.
Facebook
TwitterThis dataset was created by sajju
Released under Other (specified in description)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA BRCA non-paired sample gene level read counts from Level 3 RNASeq-v2 data.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Introduction
This dataset consists of 1097 breast cancer patient cases and is designed for survival analysis using both histopathological and clinical information. The combination of these data sources allows for the exploration of disease progression patterns and the development of predictive models.
Histopathological Data
The dataset includes a folder containing histopathological image patches extracted from whole-slide imaging (WSI) scans.
Optical magnification: x20
Patch size: 1000 x 1000 pixels
Region selection: Only patches containing tissue are included, discarding areas without relevant information
Image-Derived Data
For each patient, a CSV file is provided with extracted information from the histopathological patches:
Histograms: Representation of the pixel intensity distribution in each image
Cell count: Number of cells present in the selected patches
Clinical Data
A second CSV file contains clinical information about the patients, which is essential for survival analysis. The included variables are:
Time until death: The time elapsed until the patient’s death
Vital status: Indicates whether the patient is deceased or still alive
Other clinical variables: Factors that may influence survival and help contextualize the histopathological data
Dataset Objective
The primary objective of this dataset is to facilitate the development of survival models that integrate histopathological and clinical information. This will help identify patterns in breast cancer progression and enhance predictive capabilities for estimating patient survival time.
This dataset is ideal for exploring machine learning methods applied to digital pathology and survival analysis in oncology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides mitosis detection results employing the "Mitosis Detection, Fast and Slow" (MDFS) algorithm [[2208.12587] Mitosis Detection, Fast and Slow: Robust and Efficient Detection of Mitotic Figures (arxiv.org)] on the TCGA-BRCA dataset.
The MDFS algorithm exemplifies a robust and efficient two-stage process for mitosis detection. Initially, potential mitotic figures are identified and later refined. The proposed model for the preliminary identification of candidates, the EUNet, stands out for its swift and accurate performance, largely due to its structural design. EUNet operates by outlining candidate areas at a lower resolution, significantly expediting the detection process. In the second phase, the initially identified candidates undergo further refinement using a more intricate classifier network, namely the EfficientNet-B7. The MDFS algorithm was originally developed for the MIDOG challenges.
Viewing in QuPath
The dataset at hand comprises GeoJSON files in two categories: mitosis and proxy (mimicker -- the candidates that are unlikely to be mitosis based on our algorithm). Users can open and visualize each category overlaid on the Whole Slide Image (WSI) using QuPath. Simply drag and drop the annotation file onto the opened image in the program. Additionally, users can employ the provided Python snippet to read the annotation into a Python dictionary or a Numpy array.
Loading in Python
To load the GeoJSON files in Python, users can use the following code:
import json
import numpy as np
import pandas as pd
def load_geojson(filename):
# Load the GeoJSON file
with open(filename, 'r') as f:
data = json.load(f)
# Extract the properties and store in a dictionary
slide_properties = data["properties"]
# Convert the points to a numpy array
points_np = np.array([(feat['geometry']['coordinates'][0], feat['geometry']['coordinates'][1], feat['properties']['score']) for feat in data['features']])
# Convert the points to a pandas DataFrame
points_df = pd.DataFrame(points_np, columns=['x', 'y', 'score'])
return slide_properties, points_np, points_df
mitosis_properties, mitosis_points_np, mitosis_points_df = load_geojson('mitosis.geojson')
mimickers_properties, mimickers_points_np, mimickers_points_df = load_geojson('mimickers.geojson')
Properties
Each WSI in the dataset includes the candidate's centroid, bounding box, hotspot location, hotspot mitotic count, and hotspot mitotic score. The structures of the mitosis and mimicker property dictionaries are as follows:
Mitosis property dictionary structure:
mitosis_properties = {
'slide_id': slide_id,
'slide_height': img_h,
'slide_width': img_w,
'wsi_mitosis_count': num_mitosis,
'mitosis_threshold': 0.5,
'hotspot_rect': {'x1': hotspot[0], 'y1': hotspot[1], 'x2': hotspot[2], 'y2': hotspot[3]},
'hotspot_mitosis_count': mitosis_count,
'hotspot_mitosis_score': mitosis_score,
}
Proxy figure (mimicker) property dictionary structure:
mimicker_properties = {
'slide_id': slide_id,
'slide_height': img_h,
'slide_width': img_w,
'wsi_mimicker_count': num_mimicker,
'mitosis_threshold': 0.5,
}
Disclaimer:
It should be noted that we did not conduct a comprehensive review of all mitotic figures within each WSI, and we do not purport these to be free of errors. Nonetheless, a pathologist examined the resultant hotspot regions of interest from 757 WSIs within the TCGA-BRCA Mitosis Dataset where we found strong correlations between pathologist and MDFS mitotic counts (r=0.8, p$<$0.001). Furthermore, MDFS-derived mitosis scores are shown to be as prognostic as pathologist-assigned mitosis scores [1]. This examination was also aimed at verifying the quality of the selections, ensuring excessive false detections or artifacts did not primarily drive them and were in a plausible location in the tumor landscape.
[1] Ibrahim, Asmaa, et al. "Artificial Intelligence-Based Mitosis Scoring in Breast Cancer: Clinical Application." Modern Pathology 37.3 (2024): 100416.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA BRCA non-paired sample isoform level read counts from Level 3 RNASeq-v2 data.
Facebook
TwitterThis dataset was created by RAJIB BAG_1
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Arya Z.E.
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
470 pairs of isoforms with switching events in TCGA BRCA dataset. For each isoform pair, two clusters of samples with differential isoform expression pattern was identified by K-means clustering; and was summarized on the complete data set (K-means cluster sample size (overall)) and on the subset in which both isoforms were detected (K-means cluster sample size (detected)). (XLSX 63 kb)
Facebook
TwitterBackground: DNA methylation is a common event in the early development of various tumors, including breast cancer (BRCA), which has been studies as potential tumor biomarkers. Although previous studies have reported a cluster of aberrant promoter methylation changes in BRCA, none of these research groups have proved the specificity of these DNA methylation changes. Here we aimed to identify specific DNA methylation signatures in BRCA which can be used as diagnostic and prognostic markers.Methods: Differentially methylated sites were identified using the Cancer Genome Atlas (TCGA) BRCA data set. We screened for BRCA-differential methylation by comparing methylation profiles of BRCA patients, healthy breast biopsies and blood samples. These differential methylated sites were compared to nine main cancer samples to identify BRCA specific methylated sites. A BayesNet model was built to distinguish BRCA patients from healthy donors. The model was validated using three Gene Expression Omnibus (GEO) independent data sets. In addition, we also carried out the Cox regression analysis to identify DNA methylation markers which are significantly related to the overall survival (OS) rate of BRCA patients and verified them in the validation cohort.Results: We identified seven differentially methylated sites (DMSs) that were highly correlated with cell cycle as potential specific diagnostic biomarkers for BRCA patients. The combination of 7 DMSs achieved ~94% sensitivity in predicting BRCA, ~95% specificity comparing healthy vs. cancer samples, and ~88% specificity in excluding other cancers. The 7 DMSs were highly correlated with cell cycle. We also identified 6 methylation sites that are highly correlated with the OS of BRCA patients and can be used to accurately predict the survival of BRCA patients (training cohort: likelihood ratio = 70.25, p = 3.633 × 10−13, area under the curve (AUC) = 0.784; validation cohort: AUC = 0.734). Stratification analysis by age, clinical stage, Tumor types, and chemotherapy retained statistical significance.Conclusion: In summary, our study demonstrated the role of methylation profiles in the diagnosis and prognosis of BRCA. This signature is superior to currently published methylation markers for diagnosis and prognosis for BRCA patients. It can be used as promising biomarkers for early diagnosis and prognosis of BRCA.
Facebook
TwitterTCGA Breast Invasive Carcinoma. Source data from GDAC Firehose. Previously known as TCGA Provisional. This dataset contains summary data visualizations and clinical data from a broad sampling of 1,108 carcinomas from 1,101 patients. The data was gathered as part of the Broad Institute of MIT and Harvard Firehose initiative, a cancer analysis pipeline. The clinical data includes mutation count, information about mutated genes, patient demographics, sample type, disease code, Adjuvant Postoperative Pharmaceutical Therapy Administered Indicator, American Joint Committee on Cancer Metastasis Stage Code, American Joint Committee on Cancer Publication Version Type, American Joint Committee on Cancer Tumor Stage Code, Brachytherapy first reference point administered total dose, Cent17 Copy Number, and the Days to Sample Collection. The dataset includes Next-Generation Clustered Heat Maps (NG-CHM) viewable via an embedded NG-CHM Heat Map Viewer, provided my MD Anderson Cancer Center, which provides a graphical environment for exploration of clustered or non-clustered heat map data. The data set also includes copy-number segment data downloadable as .seg files and viewable via the Integrative Genomics Viewer.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Breast Phenotype Research Group.