Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aim: To predict base-resolution DNA methylation in cancerous and paracancerous tissues. Material & methods: We collected six cancer DNA methylation datasets from The Cancer Genome Atlas and five cancer datasets from Gene Expression Omnibus and established machine learning models using paired cancerous and paracancerous tissues. Tenfold cross-validation and independent validation were performed to demonstrate the effectiveness of the proposed method. Results: The developed cross-tissue prediction models can substantially increase the accuracy at more than 68% of CpG sites and contribute to enhancing the statistical power of differential methylation analyses. An XGBoost model leveraging multiple correlating CpGs may elevate the prediction accuracy. Conclusion: This study provides a powerful tool for DNA methylation analysis and has the potential to gain new insights into cancer research from epigenetics. The authors employed machine learning models to predict genome-wide DNA methylation (DNAm) levels in cancerous tissues (CTs) and paracancerous tissues (PTs) when one of them is difficult to obtain. The proposed model based on a single CpG site achieves an improvement of mean absolute error at more than 68% of CpGs. A multiple-CpG-based XGBoost model can further improve the predictive performance when there is considerable variability between individuals. The detected CpG sites in differential methylation analysis are statistically more significant by combining the measured and predicted PTs to enlarge the sample size. When using CTs as predictors instead of PTs, the prediction models have better performance. The aggressiveness of cancers and patient outcome may be predictable using well-predicted DNAm profiles in CT/PT. Functional enrichment analysis based on highly correlated CpG sites identified important pathways involved in cancer progression. The cross-tumor DNAm prediction model has the potential to be applied to an external cancer dataset for a subset of probes with high correlation in both cancers.