CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network

Abstract

With recent advances in DNA sequencing technologies, fast acquisition of large-scale genomic data has become commonplace. For cancer studies, in particular, there is an increasing need for the classifcation of cancer type based on somatic alterations detected from sequencing analyses. However, the everincreasing size and complexity of the data make the classifcation task extremely challenging. In this study, we evaluate the contributions of various input features, such as mutation profles, mutation rates, mutation spectra and signatures, and somatic copy number alterations that can be derived from genomic data, and further utilize them for accurate cancer type classifcation. We introduce a novel ensemble of machine learning classifers, called CPEM (Cancer Predictor using an Ensemble Model), which is tested on 7,002 samples representing over 31 diferent cancer types collected from The Cancer Genome Atlas (TCGA) database. We frst systematically examined the impact of the input features. Features known to be associated with specifc cancers had relatively high importance in our initial prediction model. We further investigated various machine learning classifers and feature selection methods to derive the ensemble-based cancer type prediction model achieving up to 84% classifcation accuracy in the nested 10-fold cross-validation. Finally, we narrowed down the target cancers to the six most common types and achieved up to 94% accuracy.

Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network

[bibtex file=kglee_2019.bib format=plain
process_titles=0]